Problem Statement¶

Context¶

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective¶

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.

Data Dictionary¶

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIP Code: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
  • Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
  • Online: Do customers use internet banking facilities? (0: No, 1: Yes)
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

Importing necessary libraries¶

In [8]:
!jupyter nbconvert --to html "/content/drive/MyDrive/Colab Notebooks/Final copy of project2 - Linear Regression Logistic Regression and Decision Tree - Jerlin.ipynb"
This application is used to convert notebook files (*.ipynb)
        to various other formats.

        WARNING: THE COMMANDLINE INTERFACE MAY CHANGE IN FUTURE RELEASES.

Options
=======
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    Equivalent to: [--ExecutePreprocessor.enabled=True]
--allow-errors
    Continue notebook execution even if one of the cells throws an error and include the error message in the cell output (the default behaviour is to abort conversion). This flag is only relevant if '--execute' was specified, too.
    Equivalent to: [--ExecutePreprocessor.allow_errors=True]
--stdin
    read a single notebook file from stdin. Write the resulting notebook with default basename 'notebook.*'
    Equivalent to: [--NbConvertApp.from_stdin=True]
--stdout
    Write notebook output to stdout instead of files.
    Equivalent to: [--NbConvertApp.writer_class=StdoutWriter]
--inplace
    Run nbconvert in place, overwriting the existing notebook (only
            relevant when converting to notebook format)
    Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory=]
--clear-output
    Clear output of current file and save in place,
            overwriting the existing notebook.
    Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory= --ClearOutputPreprocessor.enabled=True]
--coalesce-streams
    Coalesce consecutive stdout and stderr outputs into one stream (within each cell).
    Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory= --CoalesceStreamsPreprocessor.enabled=True]
--no-prompt
    Exclude input and output prompts from converted document.
    Equivalent to: [--TemplateExporter.exclude_input_prompt=True --TemplateExporter.exclude_output_prompt=True]
--no-input
    Exclude input cells and output prompts from converted document.
            This mode is ideal for generating code-free reports.
    Equivalent to: [--TemplateExporter.exclude_output_prompt=True --TemplateExporter.exclude_input=True --TemplateExporter.exclude_input_prompt=True]
--allow-chromium-download
    Whether to allow downloading chromium if no suitable version is found on the system.
    Equivalent to: [--WebPDFExporter.allow_chromium_download=True]
--disable-chromium-sandbox
    Disable chromium security sandbox when converting to PDF..
    Equivalent to: [--WebPDFExporter.disable_sandbox=True]
--show-input
    Shows code input. This flag is only useful for dejavu users.
    Equivalent to: [--TemplateExporter.exclude_input=False]
--embed-images
    Embed the images as base64 dataurls in the output. This flag is only useful for the HTML/WebPDF/Slides exports.
    Equivalent to: [--HTMLExporter.embed_images=True]
--sanitize-html
    Whether the HTML in Markdown cells and cell outputs should be sanitized..
    Equivalent to: [--HTMLExporter.sanitize_html=True]
--log-level=<Enum>
    Set the log level by value or name.
    Choices: any of [0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL']
    Default: 30
    Equivalent to: [--Application.log_level]
--config=<Unicode>
    Full path of a config file.
    Default: ''
    Equivalent to: [--JupyterApp.config_file]
--to=<Unicode>
    The export format to be used, either one of the built-in formats
            ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'qtpdf', 'qtpng', 'rst', 'script', 'slides', 'webpdf']
            or a dotted object name that represents the import path for an
            ``Exporter`` class
    Default: ''
    Equivalent to: [--NbConvertApp.export_format]
--template=<Unicode>
    Name of the template to use
    Default: ''
    Equivalent to: [--TemplateExporter.template_name]
--template-file=<Unicode>
    Name of the template file to use
    Default: None
    Equivalent to: [--TemplateExporter.template_file]
--theme=<Unicode>
    Template specific theme(e.g. the name of a JupyterLab CSS theme distributed
    as prebuilt extension for the lab template)
    Default: 'light'
    Equivalent to: [--HTMLExporter.theme]
--sanitize_html=<Bool>
    Whether the HTML in Markdown cells and cell outputs should be sanitized.This
    should be set to True by nbviewer or similar tools.
    Default: False
    Equivalent to: [--HTMLExporter.sanitize_html]
--writer=<DottedObjectName>
    Writer class used to write the
                                        results of the conversion
    Default: 'FilesWriter'
    Equivalent to: [--NbConvertApp.writer_class]
--post=<DottedOrNone>
    PostProcessor class used to write the
                                        results of the conversion
    Default: ''
    Equivalent to: [--NbConvertApp.postprocessor_class]
--output=<Unicode>
    Overwrite base name use for output files.
                Supports pattern replacements '{notebook_name}'.
    Default: '{notebook_name}'
    Equivalent to: [--NbConvertApp.output_base]
--output-dir=<Unicode>
    Directory to write output(s) to. Defaults
                                  to output to the directory of each notebook. To recover
                                  previous default behaviour (outputting to the current
                                  working directory) use . as the flag value.
    Default: ''
    Equivalent to: [--FilesWriter.build_directory]
--reveal-prefix=<Unicode>
    The URL prefix for reveal.js (version 3.x).
            This defaults to the reveal CDN, but can be any url pointing to a copy
            of reveal.js.
            For speaker notes to work, this must be a relative path to a local
            copy of reveal.js: e.g., "reveal.js".
            If a relative path is given, it must be a subdirectory of the
            current directory (from which the server is run).
            See the usage documentation
            (https://nbconvert.readthedocs.io/en/latest/usage.html#reveal-js-html-slideshow)
            for more details.
    Default: ''
    Equivalent to: [--SlidesExporter.reveal_url_prefix]
--nbformat=<Enum>
    The nbformat version to write.
            Use this to downgrade notebooks.
    Choices: any of [1, 2, 3, 4]
    Default: 4
    Equivalent to: [--NotebookExporter.nbformat_version]

Examples
--------

    The simplest way to use nbconvert is

            > jupyter nbconvert mynotebook.ipynb --to html

            Options include ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'qtpdf', 'qtpng', 'rst', 'script', 'slides', 'webpdf'].

            > jupyter nbconvert --to latex mynotebook.ipynb

            Both HTML and LaTeX support multiple output templates. LaTeX includes
            'base', 'article' and 'report'.  HTML includes 'basic', 'lab' and
            'classic'. You can specify the flavor of the format used.

            > jupyter nbconvert --to html --template lab mynotebook.ipynb

            You can also pipe the output to stdout, rather than a file

            > jupyter nbconvert mynotebook.ipynb --stdout

            PDF is generated via latex

            > jupyter nbconvert mynotebook.ipynb --to pdf

            You can get (and serve) a Reveal.js-powered slideshow

            > jupyter nbconvert myslides.ipynb --to slides --post serve

            Multiple notebooks can be given at the command line in a couple of
            different ways:

            > jupyter nbconvert notebook*.ipynb
            > jupyter nbconvert notebook1.ipynb notebook2.ipynb

            or you can specify the notebooks list in a config file, containing::

                c.NbConvertApp.notebooks = ["my_notebook.ipynb"]

            > jupyter nbconvert --config mycfg.py

To see all available configurables, use `--help-all`.

[NbConvertApp] WARNING | pattern '/content/drive/MyDrive/Colab Notebooks/Final copy of project2 - Linear Regression Logistic Regression and Decision Tree - Jerlin.ipynb' matched no files
In [9]:
# Installing the libraries with the specified version.
!pip install numpy pandas matplotlib seaborn scikit-learn sklearn-pandas -q --user
[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: C:\Users\paluv\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
In [10]:
import sys
print(sys.version)
3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]

Note:

  1. After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab), write the relevant code for the project from the next cell, and run all cells sequentially from the next cell.

  2. On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.

Loading the dataset¶

In [11]:
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Library to split data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Library to build model for prediction
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# Library to get different metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
)

# Library to build linear model for statistical analysis and prediction
from sklearn.linear_model import LinearRegression

#To build linear model for satistical analysis and prediction
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


#library required for structured python code
from sklearn_pandas import DataFrameMapper

# To ignore warnings
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline
sns.set_style('darkgrid')
sns.set(color_codes=True)

#Remove the limit for number of displayed columns and rows
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
In [12]:
# library required for structured Python code
from sklearn_pandas import DataFrameMapper
In [13]:
#Library to split the data
from sklearn.model_selection import train_test_split

#Library to build model for prediction
from sklearn.tree import DecisionTreeClassifier

#Library to build decision tree models
from sklearn import tree

#Library to build Logistic regression model
from sklearn.linear_model import LogisticRegression

#Library to get different metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
)

#Library to build linear model for statistical analysis and prediction
from sklearn.linear_model import LinearRegression

Data Overview¶

  • Observations
  • Sanity checks
In [14]:
# from google.colab import drive
# drive.mount('/content/drive')
In [15]:
# df = pd.read_csv('//content/drive/MyDrive/Project2/Loan_Modelling.csv')
df = pd.read_csv('Loan_Modelling.csv')
In [16]:
#creating a copy of the dataset
df_copy = df.copy()

Data Structure Analysis¶

In [17]:
#Viewing top observations from the dataset
df.head()
Out[17]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [18]:
#Viewing bottom observations from the dataset
df.tail()
Out[18]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
4995 4996 29 3 40 92697 1 1.9 3 0 0 0 0 1 0
4996 4997 30 4 15 92037 4 0.4 1 85 0 0 0 1 0
4997 4998 63 39 24 93023 2 0.3 3 0 0 0 0 0 0
4998 4999 65 40 49 90034 3 0.5 2 0 0 0 0 1 0
4999 5000 28 4 83 92612 3 0.8 1 0 0 0 0 1 1

Observations:

  • Age,experience and income indicate a wide range of numbers
  • ID is set as the index
  • Found mortgage has many 0's and 3 digit number indicating a skwed data
  • Zipcode can be restructured into county for easy analysis
  • Family and Education are under 5
  • Personal_loan,securities_account,CD_Account,Online and creditc ard are all binary values
In [19]:
#Viewing the overall information of the dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB

Observation¶

  1. There is no missing values listed
  2. Memory used is 547.0 KB
  3. There are 12 float integers and 1 float type columns.
In [20]:
df.isnull().sum()
Out[20]:
ID                    0
Age                   0
Experience            0
Income                0
ZIPCode               0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal_Loan         0
Securities_Account    0
CD_Account            0
Online                0
CreditCard            0
dtype: int64

Observation¶

  1. There is no null present
In [21]:
df.nunique()
Out[21]:
ID                    5000
Age                     45
Experience              47
Income                 162
ZIPCode                467
Family                   4
CCAvg                  108
Education                3
Mortgage               347
Personal_Loan            2
Securities_Account       2
CD_Account               2
Online                   2
CreditCard               2
dtype: int64

Observation

  1. It give us an idea of number of unique values in each column.
In [22]:
#Checking the shape of the dataset
df.shape
Out[22]:
(5000, 14)

There are 5000 rows and 13 columns in this dataframe.

In [23]:
#checking the duplicates
df.duplicated().sum()
Out[23]:
0

There is no duplicates in the data.

Initial Data processing¶

In [24]:
##library required to convert zipcode to county
!pip install uszipcode
import sys
print(sys.version)
Requirement already satisfied: uszipcode in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (1.0.1)
Requirement already satisfied: attrs in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (24.2.0)
Requirement already satisfied: requests in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (2.32.3)
Requirement already satisfied: pathlib-mate in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (1.3.2)
Requirement already satisfied: atomicwrites in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (1.4.1)
Requirement already satisfied: fuzzywuzzy in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (0.18.0)
Requirement already satisfied: haversine>=2.5.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (2.8.1)
Requirement already satisfied: SQLAlchemy>=1.4.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (2.0.35)
Requirement already satisfied: sqlalchemy-mate>=1.4.28.3 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (2.0.0.0)
Requirement already satisfied: typing-extensions>=4.6.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from SQLAlchemy>=1.4.0->uszipcode) (4.12.2)
Requirement already satisfied: greenlet!=0.4.17 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from SQLAlchemy>=1.4.0->uszipcode) (3.1.1)
Requirement already satisfied: prettytable<4.0.0,>=3.0.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from sqlalchemy-mate>=1.4.28.3->uszipcode) (3.11.0)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->uszipcode) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->uszipcode) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->uszipcode) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->uszipcode) (2024.8.30)
Requirement already satisfied: wcwidth in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from prettytable<4.0.0,>=3.0.0->sqlalchemy-mate>=1.4.28.3->uszipcode) (0.2.13)
3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]
[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: C:\Users\paluv\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
In [25]:
#user defined function
def zipcode_to_county(zipcode):
    from uszipcode import SearchEngine
    search = SearchEngine(simple_zipcode=True)
    result = search.by_zipcode(zipcode)
    return result.to_dict()['county']

    #library required to convert zipcode to county
!pip install usZipcode==1.0.1 sqlalchemy_mate==2.0.0.0
from uszipcode import SearchEngine
import pandas as pd


def Zipcode_to_county(Zipc):
    search = SearchEngine()
    Zipcode = search.by_zipcode(Zipc)  # Converting zipcode to zipcode object
    if Zipcode:  # Checking if Zipcode object exists
        return Zipcode.county
    else:
        return None  # Return None if zipcode is not found


#applying the user defined function and creating a new column 'County'
# Check if 'zip_code' column exists, if not, try variations or check your data
if 'zip_code' in df.columns:
    df['county'] = df['zip_code'].apply(Zipcode_to_county)
else:

    possible_column_names = ['zipcode', 'zip_code', 'ZIPCODE']
    for col_name in possible_column_names:
        if col_name in df.columns:
            print(f"Found column '{col_name}', using it instead of 'zip_code'")
            df['county'] = df[col_name].apply(Zipcode_to_county)
            break
    else:
        print("No suitable column found in the DataFrame.")
Requirement already satisfied: usZipcode==1.0.1 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (1.0.1)
Requirement already satisfied: sqlalchemy_mate==2.0.0.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (2.0.0.0)
Requirement already satisfied: attrs in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from usZipcode==1.0.1) (24.2.0)
Requirement already satisfied: requests in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from usZipcode==1.0.1) (2.32.3)
Requirement already satisfied: pathlib-mate in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from usZipcode==1.0.1) (1.3.2)
Requirement already satisfied: atomicwrites in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from usZipcode==1.0.1) (1.4.1)
Requirement already satisfied: fuzzywuzzy in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from usZipcode==1.0.1) (0.18.0)
Requirement already satisfied: haversine>=2.5.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from usZipcode==1.0.1) (2.8.1)
Requirement already satisfied: SQLAlchemy>=1.4.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from usZipcode==1.0.1) (2.0.35)
Requirement already satisfied: prettytable<4.0.0,>=3.0.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from sqlalchemy_mate==2.0.0.0) (3.11.0)
Requirement already satisfied: wcwidth in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from prettytable<4.0.0,>=3.0.0->sqlalchemy_mate==2.0.0.0) (0.2.13)
Requirement already satisfied: typing-extensions>=4.6.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from SQLAlchemy>=1.4.0->usZipcode==1.0.1) (4.12.2)
Requirement already satisfied: greenlet!=0.4.17 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from SQLAlchemy>=1.4.0->usZipcode==1.0.1) (3.1.1)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->usZipcode==1.0.1) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->usZipcode==1.0.1) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->usZipcode==1.0.1) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->usZipcode==1.0.1) (2024.8.30)
[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: C:\Users\paluv\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
No suitable column found in the DataFrame.
In [26]:
#Applying the user defined function and creating a new column 'County'
!pip install uszipcode
df['County'] = df['ZIPCode'].apply
Requirement already satisfied: uszipcode in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (1.0.1)
Requirement already satisfied: attrs in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (24.2.0)
Requirement already satisfied: requests in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (2.32.3)
Requirement already satisfied: pathlib-mate in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (1.3.2)
Requirement already satisfied: atomicwrites in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (1.4.1)
Requirement already satisfied: fuzzywuzzy in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (0.18.0)
Requirement already satisfied: haversine>=2.5.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (2.8.1)
Requirement already satisfied: SQLAlchemy>=1.4.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (2.0.35)
Requirement already satisfied: sqlalchemy-mate>=1.4.28.3 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (2.0.0.0)
Requirement already satisfied: typing-extensions>=4.6.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from SQLAlchemy>=1.4.0->uszipcode) (4.12.2)
Requirement already satisfied: greenlet!=0.4.17 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from SQLAlchemy>=1.4.0->uszipcode) (3.1.1)
Requirement already satisfied: prettytable<4.0.0,>=3.0.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from sqlalchemy-mate>=1.4.28.3->uszipcode) (3.11.0)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->uszipcode) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->uszipcode) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->uszipcode) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->uszipcode) (2024.8.30)
Requirement already satisfied: wcwidth in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from prettytable<4.0.0,>=3.0.0->sqlalchemy-mate>=1.4.28.3->uszipcode) (0.2.13)
[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: C:\Users\paluv\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
In [27]:
search = SearchEngine()
df['County'] = df['ZIPCode'].apply(search.by_zipcode)
df['County'] = df['County'].apply(lambda x: x.to_dict()['county'] if x else None)
df.head(10)
Out[27]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard County
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0 Los Angeles County
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0 Los Angeles County
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0 Alameda County
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0 San Francisco County
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1 Los Angeles County
5 6 37 13 29 92121 4 0.4 2 155 0 0 0 1 0 San Diego County
6 7 53 27 72 91711 2 1.5 2 0 0 0 0 1 0 Los Angeles County
7 8 50 24 22 93943 1 0.3 3 0 0 0 0 0 1 Monterey County
8 9 35 10 81 90089 3 0.6 2 104 0 0 0 1 0 Los Angeles County
9 10 34 9 180 93023 1 8.9 3 0 1 0 0 0 0 Ventura County
In [28]:
df.nunique()
Out[28]:
ID                    5000
Age                     45
Experience              47
Income                 162
ZIPCode                467
Family                   4
CCAvg                  108
Education                3
Mortgage               347
Personal_Loan            2
Securities_Account       2
CD_Account               2
Online                   2
CreditCard               2
County                  38
dtype: int64

Observation¶

467 unique values in ZIPCode decreased to 38 unique values in 'County'

In [29]:
#Making a list of columns
df.columns
Out[29]:
Index(['ID', 'Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg',
       'Education', 'Mortgage', 'Personal_Loan', 'Securities_Account',
       'CD_Account', 'Online', 'CreditCard', 'County'],
      dtype='object')
In [30]:
df.describe(include="all")
Out[30]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard County
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.00000 5000.000000 5000.000000 4966
unique NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 38
top NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Los Angeles County
freq NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1095
mean 2500.500000 45.338400 20.104600 73.774200 93169.257000 2.396400 1.937938 1.881000 56.498800 0.096000 0.104400 0.06040 0.596800 0.294000 NaN
std 1443.520003 11.463166 11.467954 46.033729 1759.455086 1.147663 1.747659 0.839869 101.713802 0.294621 0.305809 0.23825 0.490589 0.455637 NaN
min 1.000000 23.000000 -3.000000 8.000000 90005.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 NaN
25% 1250.750000 35.000000 10.000000 39.000000 91911.000000 1.000000 0.700000 1.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 NaN
50% 2500.500000 45.000000 20.000000 64.000000 93437.000000 2.000000 1.500000 2.000000 0.000000 0.000000 0.000000 0.00000 1.000000 0.000000 NaN
75% 3750.250000 55.000000 30.000000 98.000000 94608.000000 3.000000 2.500000 3.000000 101.000000 0.000000 0.000000 0.00000 1.000000 1.000000 NaN
max 5000.000000 67.000000 43.000000 224.000000 96651.000000 4.000000 10.000000 3.000000 635.000000 1.000000 1.000000 1.00000 1.000000 1.000000 NaN
In [31]:
# Printing top 5 unique values in each column

for i in df.columns:
    print("\nUnique values in", i, "are :")
    print(df[i].value_counts().head())
    print("\n")
    print("-" * 40)
Unique values in ID are :
ID
1       1
3331    1
3338    1
3337    1
3336    1
Name: count, dtype: int64


----------------------------------------

Unique values in Age are :
Age
35    151
43    149
52    145
54    143
58    143
Name: count, dtype: int64


----------------------------------------

Unique values in Experience are :
Experience
32    154
20    148
9     147
5     146
23    144
Name: count, dtype: int64


----------------------------------------

Unique values in Income are :
Income
44    85
38    84
81    83
41    82
39    81
Name: count, dtype: int64


----------------------------------------

Unique values in ZIPCode are :
ZIPCode
94720    169
94305    127
95616    116
90095     71
93106     57
Name: count, dtype: int64


----------------------------------------

Unique values in Family are :
Family
1    1472
2    1296
4    1222
3    1010
Name: count, dtype: int64


----------------------------------------

Unique values in CCAvg are :
CCAvg
0.3    241
1.0    231
0.2    204
2.0    188
0.8    187
Name: count, dtype: int64


----------------------------------------

Unique values in Education are :
Education
1    2096
3    1501
2    1403
Name: count, dtype: int64


----------------------------------------

Unique values in Mortgage are :
Mortgage
0      3462
98       17
119      16
89       16
91       16
Name: count, dtype: int64


----------------------------------------

Unique values in Personal_Loan are :
Personal_Loan
0    4520
1     480
Name: count, dtype: int64


----------------------------------------

Unique values in Securities_Account are :
Securities_Account
0    4478
1     522
Name: count, dtype: int64


----------------------------------------

Unique values in CD_Account are :
CD_Account
0    4698
1     302
Name: count, dtype: int64


----------------------------------------

Unique values in Online are :
Online
1    2984
0    2016
Name: count, dtype: int64


----------------------------------------

Unique values in CreditCard are :
CreditCard
0    3530
1    1470
Name: count, dtype: int64


----------------------------------------

Unique values in County are :
County
Los Angeles County    1095
San Diego County       568
Santa Clara County     563
Alameda County         500
Orange County          339
Name: count, dtype: int64


----------------------------------------

Observation¶

  • 0's out-number 1 in Mortgage,Personal_Loan, Securities_Account, CD_Account and CreditCard coulmns.
  • Los Angeles County has the highest number of observations of 1095. There is handful of counties with under 50 observations.
In [32]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
 14  County              4966 non-null   object 
dtypes: float64(1), int64(13), object(1)
memory usage: 586.1+ KB

Datatypes of CCAvg is float and county is object.Rest of them are integers.

In [33]:
#Printing bottom 5 unique values in each column

for i in df.columns:
    print("\nUnique values in", i, "are :")
    print(df[i].value_counts().tail())
    print("\n")
    print("-" * 40)
Unique values in ID are :
ID
1667    1
1666    1
1665    1
1664    1
5000    1
Name: count, dtype: int64


----------------------------------------

Unique values in Age are :
Age
25    53
24    28
66    24
67    12
23    12
Name: count, dtype: int64


----------------------------------------

Unique values in Experience are :
Experience
-1     33
-2     15
 42     8
-3      4
 43     3
Name: count, dtype: int64


----------------------------------------

Unique values in Income are :
Income
202    2
203    2
189    2
224    1
218    1
Name: count, dtype: int64


----------------------------------------

Unique values in ZIPCode are :
ZIPCode
96145    1
94087    1
91024    1
93077    1
94598    1
Name: count, dtype: int64


----------------------------------------

Unique values in Family are :
Family
1    1472
2    1296
4    1222
3    1010
Name: count, dtype: int64


----------------------------------------

Unique values in CCAvg are :
CCAvg
3.25    1
3.67    1
4.67    1
8.90    1
2.75    1
Name: count, dtype: int64


----------------------------------------

Unique values in Education are :
Education
1    2096
3    1501
2    1403
Name: count, dtype: int64


----------------------------------------

Unique values in Mortgage are :
Mortgage
547    1
458    1
505    1
361    1
541    1
Name: count, dtype: int64


----------------------------------------

Unique values in Personal_Loan are :
Personal_Loan
0    4520
1     480
Name: count, dtype: int64


----------------------------------------

Unique values in Securities_Account are :
Securities_Account
0    4478
1     522
Name: count, dtype: int64


----------------------------------------

Unique values in CD_Account are :
CD_Account
0    4698
1     302
Name: count, dtype: int64


----------------------------------------

Unique values in Online are :
Online
1    2984
0    2016
Name: count, dtype: int64


----------------------------------------

Unique values in CreditCard are :
CreditCard
0    3530
1    1470
Name: count, dtype: int64


----------------------------------------

Unique values in County are :
County
Merced County      4
Trinity County     4
Lake County        4
Imperial County    3
Napa County        3
Name: count, dtype: int64


----------------------------------------

Observation

  • There are negative values in 'Experience' that needs to be fixed
  • There are counties under 5 observations as opposed to Los Angeles County with 1095 observations.
In [34]:
df[df['Experience'] < 0].value_counts().sum()
Out[34]:
51
  • Overall 51 rows with negative experience
In [35]:
df[df['Experience'] < 0].sort_values(by='Experience', ascending=True)
Out[35]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard County
4514 4515 24 -3 41 91768 4 1.00 3 0 0 0 0 1 0 Los Angeles County
2618 2619 23 -3 55 92704 3 2.40 2 145 0 0 0 1 0 Orange County
4285 4286 23 -3 149 93555 2 7.20 1 0 0 0 0 1 0 Kern County
3626 3627 24 -3 28 90089 4 1.00 3 0 0 0 0 0 0 Los Angeles County
2717 2718 23 -2 45 95422 4 0.60 2 0 0 0 0 1 1 Lake County
2876 2877 24 -2 80 91107 2 1.60 3 238 0 0 0 0 0 Los Angeles County
3130 3131 23 -2 82 92152 2 1.80 2 0 0 1 0 0 1 San Diego County
2466 2467 24 -2 80 94105 2 1.60 3 0 0 0 0 1 0 San Francisco County
3796 3797 24 -2 50 94920 3 2.40 2 0 0 1 0 0 0 Marin County
2962 2963 23 -2 81 91711 2 1.80 2 0 0 0 0 0 0 Los Angeles County
793 794 24 -2 150 94720 2 2.00 1 0 0 0 0 1 0 Alameda County
3887 3888 24 -2 118 92634 2 7.20 1 0 0 1 0 1 0 None
889 890 24 -2 82 91103 2 1.60 3 0 0 0 0 1 1 Los Angeles County
597 598 24 -2 125 92835 2 7.20 1 0 0 1 0 0 1 Orange County
315 316 24 -2 51 90630 3 0.30 3 0 0 0 0 1 0 Orange County
451 452 28 -2 48 94132 2 1.75 3 89 0 0 0 1 0 San Francisco County
4116 4117 24 -2 135 90065 2 7.20 1 0 0 0 0 1 0 Los Angeles County
4481 4482 25 -2 35 95045 4 1.00 3 0 0 0 0 1 0 San Benito County
4411 4412 23 -2 75 90291 2 1.80 2 0 0 0 0 1 1 Los Angeles County
4015 4016 25 -1 139 93106 2 2.00 1 0 0 0 0 0 1 Santa Barbara County
3157 3158 23 -1 13 94720 4 1.00 1 84 0 0 0 1 0 Alameda County
3279 3280 26 -1 44 94901 1 2.00 2 0 0 0 0 0 0 Marin County
3284 3285 25 -1 101 95819 4 2.10 3 0 0 0 0 0 1 Sacramento County
3292 3293 25 -1 13 95616 4 0.40 1 0 0 1 0 0 0 Yolo County
3394 3395 25 -1 113 90089 4 2.10 3 0 0 0 0 1 0 Los Angeles County
3425 3426 23 -1 12 91605 4 1.00 1 90 0 0 0 1 0 Los Angeles County
4088 4089 29 -1 71 94801 2 1.75 3 0 0 0 0 0 0 Contra Costa County
3824 3825 23 -1 12 95064 4 1.00 1 0 0 1 0 0 1 Santa Cruz County
3946 3947 25 -1 40 93117 3 2.40 2 0 0 0 0 1 0 Santa Barbara County
3076 3077 29 -1 62 92672 2 1.75 3 0 0 0 0 0 1 Orange County
89 90 25 -1 113 94303 4 2.30 3 0 0 0 0 0 1 San Mateo County
2848 2849 24 -1 78 94720 2 1.80 2 0 0 0 0 0 0 Alameda County
226 227 24 -1 39 94085 2 1.70 2 0 0 0 0 0 0 Santa Clara County
524 525 24 -1 75 93014 4 0.20 1 0 0 0 0 1 0 Santa Barbara County
536 537 25 -1 43 92173 3 2.40 2 176 0 0 0 1 0 San Diego County
540 541 25 -1 109 94010 4 2.30 3 314 0 0 0 1 0 San Mateo County
576 577 25 -1 48 92870 3 0.30 3 0 0 0 0 0 1 Orange County
583 584 24 -1 38 95045 2 1.70 2 0 0 0 0 1 0 San Benito County
649 650 25 -1 82 92677 4 2.10 3 0 0 0 0 1 0 Orange County
670 671 23 -1 61 92374 4 2.60 1 239 0 0 0 1 0 San Bernardino County
2980 2981 25 -1 53 94305 3 2.40 2 0 0 0 0 0 0 Santa Clara County
686 687 24 -1 38 92612 4 0.60 2 0 0 0 0 1 0 Orange County
1173 1174 24 -1 35 94305 2 1.70 2 0 0 0 0 0 0 Santa Clara County
1428 1429 25 -1 21 94583 4 0.40 1 90 0 0 0 1 0 Contra Costa County
1522 1523 25 -1 101 94720 4 2.30 3 256 0 0 0 0 1 Alameda County
1905 1906 25 -1 112 92507 2 2.00 1 241 0 0 0 1 0 Riverside County
2102 2103 25 -1 81 92647 2 1.60 3 0 0 0 0 1 1 Orange County
2430 2431 23 -1 73 92120 4 2.60 1 0 0 0 0 1 0 San Diego County
2545 2546 25 -1 39 94720 3 2.40 2 0 0 0 0 1 0 Alameda County
4582 4583 25 -1 69 92691 3 0.30 3 0 0 0 0 1 0 Orange County
909 910 23 -1 149 91709 1 6.33 1 305 0 0 0 0 1 San Bernardino County
4957 4958 29 -1 50 95842 2 1.75 3 0 0 0 0 0 1 Sacramento County

Observation

  • There are 51 rows with negative values for Experience
  • Let's look at the corresponding age, education and income values for any pattern.
In [36]:
df[df['Experience']<0]["Age"].value_counts()
Out[36]:
Age
25    18
24    17
23    12
29     3
28     1
26     1
Name: count, dtype: int64
In [37]:
df[df["Experience"]<0]["Education"].value_counts()
Out[37]:
Education
3    20
2    16
1    16
Name: count, dtype: int64
In [38]:
df[df["Experience"]<0]["Income"].value_counts()
Out[38]:
Income
82     3
113    2
149    2
12     2
80     2
39     2
81     2
101    2
50     2
35     2
13     2
38     2
75     2
48     2
62     1
40     1
139    1
71     1
28     1
135    1
41     1
44     1
118    1
73     1
53     1
78     1
45     1
55     1
112    1
21     1
150    1
61     1
125    1
109    1
43     1
51     1
69     1
Name: count, dtype: int64
  • Considering the corresponding Age,education,and incoe values, there does not seem to be any pattern and hence the negative value in expeience could most probably be a data entry error. Therefore will convert these values to absolute values instead of treating them as missing values.
In [39]:
#user defined function to convert negative values to absolute values

def convert_to_absolute(value):
    if value < 0:
        return abs(value)
    else:
        return value
In [40]:
df['Experience'] = df['Experience'].apply(convert_to_absolute)
In [41]:
df["Experience"].unique()
Out[41]:
array([ 1, 19, 15,  9,  8, 13, 27, 24, 10, 39,  5, 23, 32, 41, 30, 14, 18,
       21, 28, 31, 11, 16, 20, 35,  6, 25,  7, 12, 26, 37, 17,  2, 36, 29,
        3, 22, 34,  0, 38, 40, 33,  4, 42, 43], dtype=int64)

There is no more negative values in Experience.

In [42]:
#Checking the statistical summary of the dataset
df.describe().T
Out[42]:
count mean std min 25% 50% 75% max
ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.25 5000.0
Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.00 67.0
Experience 5000.0 20.134600 11.415189 0.0 10.00 20.0 30.00 43.0
Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.00 224.0
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.00 93437.0 94608.00 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.00 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.70 1.5 2.50 10.0
Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.00 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.00 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.00 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.00 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.00 1.0
Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.00 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.00 1.0
  • There are 5000 observations present in all
  • The minimum and maximum values indicate a widely spread in Age,Experience,Income and Mortgage *. Differences between mean and medium values indicate skewness in the data.

Exploratory Data Analysis.¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
  2. How many customers have credit cards?
  3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
  4. How does a customer's interest in purchasing a loan vary with their age?
  5. How does a customer's interest in purchasing a loan vary with their education?

Univariate Analysis¶

In [43]:
#Defining a method to print the value counts in the plots

def val_count(plot, feature):
    for p in ax.patches:
        ax.annotate(  # annotate the count
            p.get_height(),
            (
                p.get_x() + p.get_width() / 2.0,  # calculating the value counts
                p.get_height(),
            ),
            ha="center",  # horizontal alignment of the text
            va="center",  # vertical alignment
            xytext=(0, 5),  # text position
            textcoords="offset points",  # offset in points from the xy value
        )
In [44]:
#Defining a method to print the percentage of data points in the plot

def perc_on_bar(plot, feature):
    """
    plot
    feature: categorical feature
    the function won't work if a column is passed in hue parameter
    """
    total = len(feature)  # length of the column
    for p in ax.patches:
        percentage = "{:.1f}%".format(
            100 * p.get_height() / total
        )  # percentage of each class of the category
        x = p.get_x() + p.get_width() / 2 - 0.05  # width of the plot
        y = p.get_y() + p.get_height()  # hieght of the plot
        ax.annotate(percentage, (x, y), size=12)  # annotate the percantage
    plt.show()  # show the plot
In [45]:
# Defining a method to plot histogram and boxplot combined in a single plot


def histogram_boxplot(feature, figsize=(15, 10), bins=None):
    """Boxplot and histogram combined
    feature: 1-d feature array
    figsize: size of fig (default (9,8))
    bins: number of bins (default None / auto)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.distplot(
        feature, kde=F, ax=ax_hist2, bins=bins, color="orange"
    ) if bins else sns.distplot(
        feature, kde=False, ax=ax_hist2, color="green"
    )  # For histogram
    ax_hist2.axvline(
        np.mean(feature), color="purple", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        np.median(feature), color="black", linestyle="-"
    )  # Add median to the histogram
In [46]:
#setting the color palette for the plots
sns.set_palette("husl")

1. Age¶

In [47]:
import matplotlib.pyplot as plt
import seaborn as sns
# plot Histogram and boxplot combined in a single
def histogram_boxplot(data, feature):
  fig, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.25, .75)})
  sns.boxplot(data=data, x=feature, ax=ax_box)
  sns.histplot(data=data, x=feature, ax=ax_hist)
  ax_box.set(yticks=[])
  sns.despine(ax=ax_hist)
  sns.despine(ax=ax_box, left=True)
histogram_boxplot(df, "Age")
#Add mean to the histogram
plt.axvline(df["Age"].mean(), color="k", linestyle="--")
#Add median to the histogram
plt.axvline(df["Age"].median(), color="r", linestyle="-")
Out[47]:
<matplotlib.lines.Line2D at 0x1b2161cccd0>
No description has been provided for this image

**Observation

  1. The uniform distribution with five peak values are present in the histogram.
  2. Mean and Median are very close to each other
  3. The boxplot indicates no outliers.

2. Experience¶

In [48]:
histogram_boxplot(df, "Experience")
#Add mean to the histogram
plt.axvline(df["Experience"].mean(), color="k", linestyle="--")
#Add median to the histogram
plt.axvline(df["Experience"].median(), color="r", linestyle="-")
Out[48]:
<matplotlib.lines.Line2D at 0x1b216473010>
No description has been provided for this image
  • Mean and median are same
  • Boxplot show no outliers
  • The histrogram shows a uniform distribution with five peak values as like age

3. Income¶

In [49]:
histogram_boxplot(df, "Income")
#Add mean to the histogram
plt.axvline(df["Income"].mean(), color="k", linestyle="--")
#Add median to the histogram
plt.axvline(df["Income"].median(), color="r", linestyle="-")
Out[49]:
<matplotlib.lines.Line2D at 0x1b218804190>
No description has been provided for this image

The distributon in the boxplot is right skewed. The outliers spotted above the upper whisker in the boxplot. The mean and medium are same.

4. Family¶

In [50]:
plt.figure(figsize=(10, 6))
ax = sns.countplot(data=df, x="Family")
perc_on_bar(ax, df["Family"])
No description has been provided for this image
  1. Maximum number of customers are with a family size of 1.The next family sizes which have more percentage are 2 and 4
  2. The least family size in the graph is 3(20.2%)

5. Credit Card Average¶

In [51]:
import matplotlib.pyplot as plt
import seaborn as sns
def histogram_boxplot(data, feature):
  fig, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.25, .75)})
  sns.boxplot(data=data, x=feature, ax=ax_box)
  sns.histplot(data=data, x=feature, ax=ax_hist)
  ax_box.set(yticks=[])
  sns.despine(ax=ax_hist)
  sns.despine(ax=ax_box, left=True)

# plot Histogram and boxplot combined in a single
histogram_boxplot(df, "CCAvg")
#Add mean to the histogram
plt.axvline(df["CCAvg"].mean(), color="k", linestyle="--")
#Add median to the histogram
plt.axvline(df["CCAvg"].median(), color="r", linestyle="-")
Out[51]:
<matplotlib.lines.Line2D at 0x1b2188db790>
No description has been provided for this image
  • Right skewed data
  • Outliers above the upper whisker

6. Education¶

In [52]:
plt.figure(figsize=(10, 6))
ax = sns.countplot(data=df, x="Education")
perc_on_bar(ax, df["Education"])
No description has been provided for this image
  • Highest percentage of customer are under graduates

  • About 28% of the customers are graduates

  • Above 305of them are advanced/professionals

  • The distribution suggests that 'Education' can be treated as category.

7.Mortgage¶

In [53]:
Mortgage == df["Mortgage"]
Mortgage.value_counts()
print(Mortgage.value_counts())
Mortgage.unique()

histogram_boxplot(df, "Mortgage")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[53], line 1
----> 1 Mortgage == df["Mortgage"]
      2 Mortgage.value_counts()
      3 print(Mortgage.value_counts())

NameError: name 'Mortgage' is not defined
  1. Highly skewed towards '0'
  2. Maximum number of customers do not have mortgage
  3. Outliers spotted above the upper whisker
  4. Most do further analysis to determine if 'Mortgage'can also be treated as category.

8. Personal Loan¶

In [ ]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd # Import pandas for data manipulation


plt.figure(figsize=(10, 6))
ax = sns.countplot(data=df, x="Personal_Loan")

perc_on_bar(ax, df["Personal_Loan"])
No description has been provided for this image

1.It is found that around 90% of the customers did not accept the personal loan offered in the previous campaign. 2.There is only minimum amount of 9.6% of customers have accepted the personal loan.

9. Securities¶

In [ ]:
plt.figure(figsize=(10, 6))
ax = sns.countplot(data=df, x="Securities_Account")
perc_on_bar(ax, df["Securities_Account"])
No description has been provided for this image

Only 10.4% of the customers hold a securities account; The majority of te customers don't have this account.

10. Certificate of Deposit Account¶

In [ ]:
plt.figure(figsize=(10, 6))
ax = sns.countplot(data=df, x="CD_Account")
perc_on_bar(ax, df["CD_Account"])
No description has been provided for this image

Only 6% of te customers hold a CD account; 94% of the customer don't have CD account.

11. Online¶

In [ ]:
plt.figure(figsize=(10, 6))
ax = sns.countplot(data=df, x="Online")
perc_on_bar(ax, df["Online"])
No description has been provided for this image

59.7% of the customers utilise internet banking facilities while 40.3% of them don't utilize online banking.

12.Credit Card¶

In [ ]:
plt.figure(figsize=(10, 6))
ax = sns.countplot(data=df, x="CreditCard")
perc_on_bar(ax, df["CreditCard"])
No description has been provided for this image

29.4% of customers use a credit card issues by other banks while 70.6% of the customers don't have credit cards.

13.county¶

In [ ]:
plt.figure(figsize=(35, 20))
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd # Import pandas for data manipulation
ax = sns.countplot(df["County"])
#creating a countplot and assigning it to variable ax
val_count(ax, "County")
#Labelled on the plot
plt.xticks(rotation=90)
plt.show
Out[ ]:
matplotlib.pyplot.show
def show(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/matplotlib/pyplot.pyDisplay all open figures.

Parameters
----------
block : bool, optional
    Whether to wait for all figures to be closed before returning.

    If `True` block and run the GUI main loop until all figure windows
    are closed.

    If `False` ensure that all figure windows are displayed and return
    immediately.  In this case, you are responsible for ensuring
    that the event loop is running to have responsive figures.

    Defaults to True in non-interactive mode and to False in interactive
    mode (see `.pyplot.isinteractive`).

See Also
--------
ion : Enable interactive mode, which shows / updates the figure after
      every plotting command, so that calling ``show()`` is not necessary.
ioff : Disable interactive mode.
savefig : Save the figure to an image file instead of showing it on screen.

Notes
-----
**Saving figures to file and showing a window at the same time**

If you want an image file as well as a user interface window, use
`.pyplot.savefig` before `.pyplot.show`. At the end of (a blocking)
``show()`` the figure is closed and thus unregistered from pyplot. Calling
`.pyplot.savefig` afterwards would save a new and thus empty figure. This
limitation of command order does not apply if the show is non-blocking or
if you keep a reference to the figure and use `.Figure.savefig`.

**Auto-show in jupyter notebooks**

The jupyter backends (activated via ``%matplotlib inline``,
``%matplotlib notebook``, or ``%matplotlib widget``), call ``show()`` at
the end of every cell by default. Thus, you usually don't have to call it
explicitly there.
No description has been provided for this image

Los Angeles county has the highest number of observations of 1095. There is handful of counties with under 50 observations.

Bivariate Analysis¶

In [ ]:
## Function to plot stacked bar chart


def stacked_plot(x):
    # sns.set(palette="nipy_spectral")
    tab1 = pd.crosstab(x, data["Personal_Loan"], margins=True)
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(x, data["Personal_Loan"], normalize="index")
    tab.plot(kind="bar", stacked=True, figsize=(10, 5))
    plt.legend(loc="lower left", frameon=False)
    plt.legend(loc="upper left", bbox_to_anchor=(0, 1))
    plt.xticks(rotation=0)
    plt.show()
In [ ]:
# outlier detection using boxplot
import seaborn as sns
import matplotlib.pyplot as plt
#convert categorical data into numerical data
numerical_features = ['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage']
plt.figure(figsize=(15, 10))
for i, feature in enumerate(numerical_features):
    # Check if the feature exists in the DataFrame before plotting
    if feature in df.columns:
        plt.subplot(3, 3, i + 1)
        sns.boxplot(x=df[feature])
        plt.title(f'Boxplot of {feature}')
        plt.xlabel(feature)
        plt.tight_layout()
        plt.show()
    else:

      print(f"Feature '{feature}' not found in the DataFrame.")
      continue
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
  1. Pesonal Loan VS Age
In [ ]:
plt.figure(figsize=(10, 6))
ax = sns.boxplot(data=df, x="Age", hue="Personal_Loan", showmeans=True)
No description has been provided for this image
  1. The IQR of the age of customers who accepted the personal loan and those rejected the personal loan are both same.
  2. Both have the same mean values
  3. No outliers spotted on both boxplots
  1. Personal Loan VS Experience
In [ ]:
plt.figure(figsize=(10, 6))
ax = sns.boxplot(data=df, x="Experience", hue="Personal_Loan", showmeans=True)
plt.show()
No description has been provided for this image
  1. The mean of both boxplots are almost same
  2. The IQR ofthe experince of customers is almost similar for those accepted and rejected the loan
  3. No outliers are spotted.
  1. Personal Loan VS Income
In [ ]:
plt.figure(figsize=(10, 6))
ax = sns.boxplot(data=df, x="Income", hue="Personal_Loan", showmeans=True)
plt.show()
No description has been provided for this image
  • Customers with higher average annual income seem to have accepted the personal loan
  • The minimum average incoem of customers who accepted the loan is around $60000.00
  • Outliers spotted above the upper whisker for the ones with no personal loan
  1. Personal Loan VS Family
In [ ]:
plt.figure(figsize=(10, 6))
ax = sns.boxplot(data=df, x="Family", hue="Personal_Loan", showmeans=True)
plt.show()

ax = sns.countplot(data=df, x="Family", hue="Personal_Loan")
perc_on_bar(ax, df["Family"])
df["Family"].value_counts()
df["Family"].unique()
No description has been provided for this image
No description has been provided for this image
Out[ ]:
array([4, 3, 1, 2])
  1. These two graphs indicates that family size does not make a significant impact in making the choice of a loan
  2. On boxplot, we can see that customers with a family of 3 out numbers the rest in accpeting the loan.
  1. Personal Loan VS Credit Card Average
In [ ]:
plt.figure(figsize=(10, 6))
ax = sns.boxplot(data=df, x="CCAvg", hue="Personal_Loan", showmeans=True)
plt.show()
No description has been provided for this image
  1. It is easy to understand from the boxplot that the customers who spent more on credit card per month on an average,seem to take personal loan compares to other customers.
  1. Personal Loan VS Education
In [ ]:
def stacked_plot(x):
    """
    Creates a stacked bar plot showing the distribution of 'Personal_Loan'
    across different categories of the given column 'x'.

    Args:
        x: The column to use for the x-axis of the plot.
    """
    # sns.set(palette="nipy_spectral")
    # Replace 'data' with 'df' to use the correct DataFrame
    tab1 = pd.crosstab(x, df["Personal_Loan"], margins=True)
    print(tab1)
    print("-" * 120)
    # ... (Rest of your stacked_plot function)

import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10, 6))
ax = sns.boxplot(data=df, x="Education", hue="Personal_Loan", showmeans=True)
plt.show()
stacked_plot(df["Education"])
No description has been provided for this image
Personal_Loan     0    1   All
Education                     
1              2003   93  2096
2              1221  182  1403
3              1296  205  1501
All            4520  480  5000
------------------------------------------------------------------------------------------------------------------------
  1. Customer with graduate and advanced/profession levels of education have higher preference for personal loan than those customers who are Undergraduates.
  1. Personal Loan Vs Mortagage
In [ ]:
plt.figure(figsize=(10, 6))
ax = sns.boxplot(data=df, x="Mortgage", hue="Personal_Loan", showmeans=True)
plt.show()
No description has been provided for this image

We can see that Mortgage is skewed towards 0 , no significant lower shikers seen on the boxplot. It is understandable that the customers with mortgage seem to have accepted more personal loan on an average.

  1. Personal Loan VS Securities Account
In [ ]:
stacked_plot(df["Securities_Account"])
sns.set(palette="nipy_spectral")
ax = sns.countplot(data=df, x="Securities_Account", hue="Personal_Loan")
perc_on_bar(ax, df["Securities_Account"])
Personal_Loan          0    1   All
Securities_Account                 
0                   4058  420  4478
1                    462   60   522
All                 4520  480  5000
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Around 90% of customer who have securities account are not accepting personal loan; however, only 10% of the customers are seem to accept the personal loan.

9. Personal Loan Vs CD Account¶

In [ ]:
stacked_plot(df["CD_Account"])
ax = sns.countplot(data=df, x="CD_Account", hue="Personal_Loan")
perc_on_bar(ax, df["CD_Account"])
Personal_Loan     0    1   All
CD_Account                    
0              4358  340  4698
1               162  140   302
All            4520  480  5000
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
  • There are only 140 customers were accepted the loan out of 480 customers.

10. PERSONAL LOAN VS ONLINE¶

In [ ]:
stacked_plot(df["Online"])
ax = sns.countplot(data=df, x="Online", hue="Personal_Loan")
perc_on_bar(ax, df["Online"])
Personal_Loan     0    1   All
Online                        
0              1827  189  2016
1              2693  291  2984
All            4520  480  5000
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
  1. Found that only 291 of the customers have accepted the personal loan out of 2984 customers who avail the internet banking facilities.

11. Personal Loan VS Credit Card¶

In [ ]:
stacked_plot(df["CreditCard"])
ax = sns.countplot(data=df, x="CreditCard", hue="Personal_Loan")
perc_on_bar(ax, df["CreditCard"])
Personal_Loan     0    1   All
CreditCard                    
0              3193  337  3530
1              1327  143  1470
All            4520  480  5000
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
  1. It is showing that 1470 customers who use a credit card issued by other bank,143 of them have a pesonal loan with AllLife Bank.

12. Personal Loan VS County¶

In [ ]:
pd.crosstab(df["County"], df["Personal_Loan"]).plot(kind="bar", stacked=True, figsize=(18, 6))
Out[ ]:
<Axes: xlabel='County'>
No description has been provided for this image
  • It is remarkable that the Los Angeles County has the maximum number of customer who borrowed a personal loan.
  • Santa Clara and San Diego counties are leading in numbers after Los Angeles county.
  • Personal loan values for imperial county and Napa county need to be checked.

Multivariate Analysis¶

In [ ]:
print(df.columns)
Index(['ID', 'Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg',
       'Education', 'Mortgage', 'Personal_Loan', 'Securities_Account',
       'CD_Account', 'Online', 'CreditCard', 'County'],
      dtype='object')
In [ ]:
plt.figure(figsize=(16, 7))
#convert categorical data into numerical data
numerical_features = ['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage']
sns.pairplot(df[numerical_features])
plt.show()

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Select only numerical features for correlation analysis
numerical_df = df.select_dtypes(include=np.number)

plt.figure(figsize=(16, 7))
# Generate heatmap using the numerical features
ax = sns.heatmap(numerical_df.corr(), vmin=-1, vmax=1, annot=True, fmt=".2f", cmap="Spectral")
plt.show()
<Figure size 1600x700 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
  1. Found that age and experience have a very high correlation of 0.99
  2. Income and CCAvg ahve a ositive correlation of 0.65
  3. Personal loan and CD Account and securities account and CD Account have an equal number of correlation of 0.32.
  4. Personal loan and CCAvg have a correlation of 0.37
  5. Personal loan and income have the correlation of 0.50
  6. The least correlation of 0.28 is showing between CD account and creditcard.
In [ ]:
sns.pairplot(df, hue="Personal_Loan", corner = True)
plt.show()
No description has been provided for this image
  1. It is clear that the customers with higher income, CCAvg and Mortgage has borrowed a personal loan.
In [ ]:
col1 = df[["Age", "Experience", "Income", "CCAvg", "Mortgage", "Personal_Loan"]].columns.to_list()
col1.remove("Personal_Loan")
print(col1)
plt.figure(figsize=(16, 7))

for i, variable in enumerate(col1):
    plt.subplot(3, 3, i + 1)
    sns.boxplot(x="Personal_Loan", y=variable, data=df, showfliers = False)
    plt.title(f'Boxplot of {variable}')
    plt.xlabel(variable)
    plt.show()
['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage']
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
  1. Based on these boxplots, it is prove that the observation from the pairplot seems correct.
  2. Customers with higher income, CCAvg and Mortgage have borrowed personal loan.
  3. Age and Experience show similar distribution as expected as have high positive correlation.

Data Preprocessing¶

  • Missing value treatment
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)
In [ ]:
#Missing value treatment
df.isnull().sum()
Out[ ]:
0
ID 0
Age 0
Experience 0
Income 0
ZIPCode 0
Family 0
CCAvg 0
Education 0
Mortgage 0
Personal_Loan 0
Securities_Account 0
CD_Account 0
Online 0
CreditCard 0
County 34

There is no missing value in this dataframe.

In [ ]:
#checking feature engineering need in a data frame
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
 14  County              4966 non-null   object 
dtypes: float64(1), int64(13), object(1)
memory usage: 586.1+ KB

The memory usage of this data is 586.1KB. It contains 5000 rows and 15 columns.

In [ ]:
df["County"]. value_counts()
Out[ ]:
count
County
Los Angeles County 1095
San Diego County 568
Santa Clara County 563
Alameda County 500
Orange County 339
San Francisco County 257
San Mateo County 204
Sacramento County 184
Santa Barbara County 154
Yolo County 130
Monterey County 128
Ventura County 114
San Bernardino County 101
Contra Costa County 85
Santa Cruz County 68
Riverside County 56
Marin County 54
Kern County 54
San Luis Obispo County 33
Solano County 33
Humboldt County 32
Sonoma County 28
Fresno County 26
Placer County 24
Butte County 19
Shasta County 18
El Dorado County 17
Stanislaus County 15
San Benito County 14
San Joaquin County 13
Mendocino County 8
Siskiyou County 7
Tuolumne County 7
Merced County 4
Trinity County 4
Lake County 4
Imperial County 3
Napa County 3

In [ ]:
#Grouping counties under 40 observations into 'other county'
county = df["County"].value_counts().tail(20).keys()
df["County"] = np.where(df["County"].isin(county), "Other County", df["County"])
In [ ]:
df["County"]. value_counts()
Out[ ]:
count
County
Los Angeles County 1095
San Diego County 568
Santa Clara County 563
Alameda County 500
Orange County 339
Other County 312
San Francisco County 257
San Mateo County 204
Sacramento County 184
Santa Barbara County 154
Yolo County 130
Monterey County 128
Ventura County 114
San Bernardino County 101
Contra Costa County 85
Santa Cruz County 68
Riverside County 56
Marin County 54
Kern County 54

In [ ]:
df["Personal_Loan"].value_counts()
Out[ ]:
count
Personal_Loan
0 4520
1 480

In [ ]:
df["Securities_Account"].value_counts()
Out[ ]:
count
Securities_Account
0 4478
1 522

In [ ]:
df["CD_Account"].value_counts()
Out[ ]:
count
CD_Account
0 4698
1 302

In [ ]:
df["Online"].value_counts()
Out[ ]:
count
Online
1 2984
0 2016

In [ ]:
df["CreditCard"].value_counts()
Out[ ]:
count
CreditCard
0 3530
1 1470

In [ ]:
df["Mortgage"].value_counts()
Out[ ]:
count
Mortgage
0 3462
98 17
119 16
89 16
91 16
103 16
83 16
102 15
90 15
78 15
87 14
94 14
131 14
118 14
104 14
101 14
144 13
106 13
81 13
116 13
109 13
112 13
97 12
120 12
100 12
76 12
185 12
121 12
84 11
137 11
79 11
115 11
153 11
95 11
86 11
111 11
158 11
184 11
113 10
108 10
142 10
117 10
82 10
151 10
161 10
135 10
148 9
149 9
166 9
88 9
221 9
146 9
123 9
167 9
128 9
159 9
138 8
205 8
147 8
174 8
218 8
122 8
75 8
140 8
170 8
194 8
110 8
114 8
169 8
157 8
164 8
105 8
207 8
132 8
96 7
127 7
124 7
229 7
204 7
230 7
85 7
249 7
219 7
80 7
126 7
196 7
129 7
154 7
193 7
130 6
187 6
145 6
236 6
192 6
134 6
136 6
150 6
240 6
251 6
155 6
163 6
182 6
99 6
141 6
107 6
125 6
198 5
272 5
227 5
188 5
307 5
180 5
301 5
220 5
171 5
172 5
212 5
209 5
256 5
178 5
217 5
294 5
156 5
92 4
245 4
162 4
342 4
232 4
93 4
189 4
231 4
297 4
200 4
199 4
241 4
181 4
247 4
211 4
77 4
224 4
175 4
179 4
203 4
152 4
190 4
139 4
244 4
325 3
366 3
226 3
315 3
329 3
133 3
282 3
422 3
215 3
223 3
255 3
197 3
228 3
270 3
323 3
143 3
242 3
268 3
277 3
168 3
186 3
257 3
352 3
248 3
264 3
380 3
319 3
341 3
428 3
238 3
275 3
176 3
263 3
260 3
333 3
177 3
305 3
208 3
214 3
285 3
308 3
310 3
328 3
213 3
327 3
239 3
216 3
160 3
165 3
400 3
243 2
276 2
330 2
266 2
306 2
359 2
565 2
299 2
449 2
313 2
225 2
309 2
267 2
246 2
202 2
427 2
292 2
293 2
368 2
358 2
312 2
262 2
408 2
183 2
364 2
455 2
394 2
322 2
252 2
287 2
442 2
233 2
437 2
392 2
372 2
314 2
234 2
289 2
402 2
250 2
259 2
304 2
303 2
222 2
280 2
271 2
201 2
354 2
357 2
397 2
296 1
416 1
344 1
485 1
383 1
373 1
403 1
553 1
210 1
321 1
508 1
286 1
500 1
429 1
265 1
206 1
378 1
431 1
195 1
410 1
273 1
590 1
612 1
283 1
550 1
581 1
571 1
405 1
302 1
577 1
281 1
481 1
173 1
355 1
509 1
464 1
278 1
587 1
353 1
351 1
258 1
253 1
345 1
300 1
381 1
415 1
461 1
467 1
496 1
522 1
475 1
343 1
398 1
466 1
535 1
419 1
589 1
360 1
389 1
334 1
336 1
284 1
337 1
567 1
601 1
382 1
617 1
433 1
421 1
483 1
391 1
237 1
318 1
311 1
446 1
290 1
406 1
385 1
635 1
235 1
295 1
331 1
582 1
396 1
477 1
432 1
452 1
524 1
412 1
191 1
470 1
374 1
569 1
298 1
326 1
547 1
458 1
505 1
361 1
541 1

In [ ]:
df.groupby("Personal_Loan")["Mortgage"].value_counts()
Out[ ]:
count
Personal_Loan Mortgage
0 0 3150
98 17
83 16
103 16
89 16
102 15
91 15
90 15
119 15
78 15
131 14
87 14
104 14
101 14
94 13
116 13
112 13
118 13
109 13
144 12
100 12
81 12
185 12
95 11
121 11
79 11
76 11
137 11
153 11
106 11
97 11
120 10
117 10
135 10
113 10
84 10
108 10
151 10
158 10
86 9
128 9
82 9
184 9
142 9
161 9
111 9
159 9
166 9
115 9
148 9
146 8
157 8
174 8
167 8
218 8
207 8
169 8
164 8
88 8
105 8
123 8
122 8
110 8
114 8
132 7
221 7
80 7
75 7
96 7
196 7
170 7
230 7
154 7
205 7
149 7
147 7
219 7
85 7
194 7
140 7
126 7
138 7
127 7
129 7
193 7
124 7
187 6
182 6
204 6
236 6
136 6
145 6
107 6
125 6
130 6
155 6
141 6
249 6
220 5
192 5
256 5
217 5
188 5
134 5
163 5
240 5
180 5
172 5
178 5
156 5
229 5
150 5
241 4
272 4
245 4
232 4
212 4
247 4
227 4
224 4
199 4
251 4
200 4
244 4
99 4
171 4
181 4
189 4
139 4
190 4
152 4
93 4
92 4
162 4
77 4
179 4
160 3
248 3
268 3
242 3
143 3
133 3
257 3
263 3
264 3
198 3
270 3
239 3
275 3
285 3
294 3
297 3
307 3
328 3
333 3
341 3
366 3
165 3
323 3
238 3
175 3
208 3
223 3
177 3
168 3
226 3
209 3
228 3
211 3
186 3
203 3
214 3
215 3
231 3
310 2
213 2
306 2
325 2
201 2
322 2
319 2
315 2
308 2
314 2
313 2
329 2
330 2
408 2
402 2
455 2
397 2
394 2
392 2
380 2
352 2
202 2
303 2
305 2
309 2
262 2
260 2
271 2
222 2
267 2
225 2
176 2
277 2
255 2
252 2
250 2
233 2
243 2
276 2
197 2
216 2
293 2
289 2
287 2
280 2
311 1
355 1
326 1
327 1
331 1
334 1
336 1
337 1
342 1
344 1
345 1
353 1
354 1
357 1
398 1
358 1
359 1
360 1
361 1
364 1
368 1
381 1
382 1
383 1
385 1
389 1
321 1
312 1
304 1
302 1
173 1
183 1
191 1
195 1
206 1
234 1
235 1
246 1
258 1
259 1
265 1
266 1
273 1
278 1
283 1
284 1
286 1
290 1
292 1
296 1
298 1
299 1
300 1
396 1
400 1
565 1
601 1
635 1
403 1
553 1
550 1
524 1
509 1
405 1
406 1
415 1
419 1
422 1
427 1
428 1
429 1
433 1
437 1
442 1
449 1
452 1
458 1
461 1
464 1
470 1
475 1
481 1
496 1
500 1
505 1
508 1
1 0 312
301 5
342 3
282 3
422 2
428 2
86 2
221 2
307 2
294 2
327 2
251 2
372 2
229 2
400 2
184 2
209 2
149 2
198 2
106 2
115 2
99 2
111 2
120 2
295 1
167 1
266 1
272 1
277 1
281 1
304 1
175 1
205 1
299 1
171 1
297 1
211 1
170 1
292 1
260 1
253 1
259 1
255 1
197 1
194 1
213 1
203 1
216 1
227 1
192 1
204 1
231 1
234 1
237 1
183 1
240 1
246 1
249 1
176 1
212 1
210 1
374 1
305 1
522 1
431 1
432 1
437 1
442 1
446 1
449 1
466 1
467 1
477 1
483 1
485 1
535 1
421 1
541 1
590 1
547 1
565 1
567 1
569 1
612 1
571 1
161 1
581 1
577 1
427 1
416 1
308 1
354 1
310 1
312 1
315 1
318 1
319 1
325 1
329 1
587 1
343 1
351 1
352 1
582 1
412 1
357 1
358 1
359 1
364 1
368 1
373 1
378 1
380 1
391 1
589 1
410 1
163 1
158 1
150 1
97 1
75 1
76 1
81 1
82 1
84 1
88 1
91 1
94 1
118 1
119 1
121 1
123 1
132 1
134 1
138 1
140 1
142 1
144 1
146 1
147 1
617 1

  1. Mortgage can be restructed as Yes/No for better analysis.
  2. It is also evident that personal loan is '0' for all mortgage '0' values.
  3. We have seen that mortgage is highly skewed towards '0'
In [ ]:
#user defined function to convert mortgage to a boolean data
def mort(val):
    if val == 0:
        return "No"
    else:
        return "Yes"
Mortgage = df["Mortgage"].apply(mort)
Mortgage.value_counts()
df["Mortgage"] = df["Mortgage"].apply(mort)
df["Mortgage"].value_counts()
Out[ ]:
count
Mortgage
No 3462
Yes 1538

In [ ]:
#Analyse the customer have mortgage to accept personal loan by yes or no
df.groupby("Mortgage")["Personal_Loan"].value_counts()
Out[ ]:
count
Mortgage Personal_Loan
No 0 3150
1 312
Yes 0 1370
1 168

There are 480 customers have chance to take personal loan out of 5000 customers with the bank.

In [ ]:
df_dtree = df.copy()
df_dtree.head()
Out[ ]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard County
0 1 25 1 49 91107 4 1.6 1 No 0 1 0 0 0 Los Angeles County
1 2 45 19 34 90089 3 1.5 1 No 0 1 0 0 0 Los Angeles County
2 3 39 15 11 94720 1 1.0 1 No 0 0 0 0 0 Alameda County
3 4 35 9 100 94112 1 2.7 2 No 0 0 0 0 0 San Francisco County
4 5 35 8 45 91330 4 1.0 2 No 0 0 0 0 1 Los Angeles County

Outlier treatment¶

In [ ]:
#creating a list of numerical columns
numerical_features = ['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage']
In [ ]:
#Boxplots of numerical columns to view the outliers
plt.figure(figsize=(15, 10))
for i, feature in enumerate(numerical_features):
    plt.subplot(3, 3, i + 1)
    sns.boxplot(x=df[feature])
    plt.title(f'Boxplot of {feature}')
    plt.xlabel(feature)
    plt.tight_layout()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
  • The outliers are present in income,CCAvg and mortgage.
In [ ]:
#Creating a list of numerical columns

numerical_features = ['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage']
# User defined function to floor and cap the outliers



def treat_outliers(df, col):
    """
    treats outliers in a variable
    col: str, name of the numerical variable
    df: dataframe
    col: name of the column
    """

    Q1 = df[col].quantile(0.25)  # 1st quantile
    Q3 = df[col].quantile(0.75)  # 3rd quantile
    IQR = Q3 - Q1
    Lower_whisker = Q1 - 1.5 * IQR
    Upper_whisker = Q3 + 1.5 * IQR

    # all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
    # all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker

    df[col] = np.clip(df[col], Lower_whisker, Upper_whisker)

    return df
def treat_outliers_all(df, col_list):

  """
    treat outlier in all numerical variables
    col_list: list of numerical variables
    df: data frame
    """

  for c in col_list:
        df = treat_outliers(df, c)
  return df

# Assuming 'Mortgage_Yes/No' is the correct column name
# Convert 'Yes'/'No' to 1/0 and create a new 'Mortgage' column if it doesn't exist.
# If 'Mortgage' already exists (e.g., as a numeric type), this line won't change it.
df['Mortgage'] = df['Mortgage'].map({'Yes': 1, 'No': 0}).astype(float)
In [ ]:
#Creating a list of numerical columns
numerical_features = ['Age', 'Experience', 'Income', 'CCAvg']


df['Mortgage'] = df['Mortgage'].map({'Yes': 1, 'No': 0}).astype(float)

#Treating the outliers for numerical features ONLY
df = treat_outliers_all(df, numerical_features)
In [ ]:
#Plotting the boxplot to check if the outliers are fixed

plt.figure(figsize=(15, 10))


for i, variable in enumerate(col1):
    plt.subplot(2, 3, i + 1)
    plt.boxplot(df[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()
No description has been provided for this image

There is no outliers are present; it has been fixed now.

In [ ]:
#Categorizing datatypes

df["County"] = df["County"].astype("category")
df["Mortgage"] = df["Mortgage"].astype("category")
df["Education"] = df["Education"].astype("category")
In [ ]:
df.info()
  • The data processing has been done.
  • Next procedure is to drop the unwanted columns before proceeding with model building
In [ ]:
#dropping unwanted columns
df.drop([ 'Mortgage'], axis = 1, inplace=True)
# Corrected the column name to 'Mortgage_Yes/No'
In [ ]:
#Final dataset
df.head()
Out[ ]:
ID Age Experience Income ZIPCode Family CCAvg Education Personal_Loan Securities_Account CD_Account Online CreditCard County
0 1 25 1 49.0 91107 4 1.6 1 0 1 0 0 0 Los Angeles County
1 2 45 19 34.0 90089 3 1.5 1 0 1 0 0 0 Los Angeles County
2 3 39 15 11.0 94720 1 1.0 1 0 0 0 0 0 Alameda County
3 4 35 9 100.0 94112 1 2.7 2 0 0 0 0 0 San Francisco County
4 5 35 8 45.0 91330 4 1.0 2 0 0 0 0 1 Los Angeles County
In [ ]:
#Making a copy of the data for logistic regression model building
df_log = df.copy()

Key insights from EDA

The results shows from the EDA are maximum number of customers who opted for a personal loan are the following: *They are between the ages of 35 and 55.

  • Have a minimum experience of 9 years.
  • Have a minimum annual average income above 120000 dollars.
  • Have a fsmily size of 3+.
  • They own a CD account with the bank
  • Have an education level of two and above.
  • Spend an average of $25K on creditcards every month
  • They do have a mortgage.
  • Avail online banking facilities
  • These customers are from the Los Angeles,San Diego or Santa clara counties.

Model Building¶

Model Building - Logistic Regression

In [ ]:
# defining the independent variable
X = df.drop(columns=['ID', 'Personal_Loan'])

df_log.head()
Out[ ]:
ID Age Experience Income ZIPCode Family CCAvg Education Personal_Loan Securities_Account CD_Account Online CreditCard County
0 1 25 1 49.0 91107 4 1.6 1 0 1 0 0 0 Los Angeles County
1 2 45 19 34.0 90089 3 1.5 1 0 1 0 0 0 Los Angeles County
2 3 39 15 11.0 94720 1 1.0 1 0 0 0 0 0 Alameda County
3 4 35 9 100.0 94112 1 2.7 2 0 0 0 0 0 San Francisco County
4 5 35 8 45.0 91330 4 1.0 2 0 0 0 0 1 Los Angeles County

This data shows that the ages between 25-45 are more likely to take personal loan.

In [ ]:
def perc_on_bar(ax, df):
    """

       drop the unwanted columns before
       proceed with model building
    """

    for p in ax.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy()

X = df.drop(columns=['ID', 'Personal_Loan'])
y = df['Personal_Loan']
In [ ]:
# fitting the model to the training data
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd # You likely already have this imported, but it's best to be explicit

#Creating dummy variable for categorical features
X = pd.get_dummies(X, drop_first=True)

#convert categorical data into numerical data
X=pd.get_dummies(X, columns=X.filter(items=["Age", "Experience", "Income", "CC_Avg", "Zipcode", "Family", "Mortgage", "Education"]).columns.tolist(), drop_first=True)
X.head()

#splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
In [ ]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
print("\n Percentage of clases in training set :")
print(y_train.value_counts(normalize=True))
print("\n Percentage of clases in test set :")
print(y_test.value_counts(normalize=True))
(3500, 260)
(1500, 260)
(3500,)
(1500,)

 Percentage of clases in training set :
Personal_Loan
0    0.907714
1    0.092286
Name: proportion, dtype: float64

 Percentage of clases in test set :
Personal_Loan
0    0.895333
1    0.104667
Name: proportion, dtype: float64

We have a well balanced classes of the dependant variable in both train and test sets.

In [ ]:
#Creating training and test sets
logistic_regression_model = LogisticRegression()
logistic_regression_model.fit(X_train, y_train)
df_log.head()
X = df.drop(columns=['ID', 'Personal_Loan'])
y = df['Personal_Loan']

#creating dummy variables for categorical features
X = pd.get_dummies(X, drop_first=True)

#splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Model Evaluation Criterion¶

  • *** Model can make wrong prediction such as***: 1. Predicting a customer will not borrow the personal loan but in reality the customer will borrow the loan. 2. Predicting a customer will borrow the personal loan but in reality the customer will not borrow the loan. The concerning prediction is the first prediction. AllLife Bank is looking to expand its loan business, every single customer who will borrow the personal loan is very important. Therefore, the mistakes in the first prediction has to be considerably low.

How to resolve false Negatives? We can maximized the Recall score inorder to resolve false negatives. greater the Recall score higher the chances of predicting the potential customers who will borrow the personal loan.

Model Building¶

In [ ]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [ ]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape((2, 2))

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Logistic Regression with sklearn library¶

In [ ]:
#Using the newtown-cg solver as it is faster for high-dimential data

import warnings
warnings.filterwarnings('ignore')

lg = LogisticRegression(solver='newton-cg')
lg.fit(X_train, y_train)
Out[ ]:
LogisticRegression(solver='newton-cg')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(solver='newton-cg')

Model performance on training set

In [ ]:
#Predicting on training set
y_pred_train = lg.predict(X_train)
print("Training set performance:\n")
print(model_performance_classification(lg, X_train, y_train))
print(confusion_matrix_sklearn(lg, X_train, y_train))
print("Recall: ", recall_score(y_train, y_pred_train))
print("Accuracy: ", accuracy_score(y_train, y_pred_train))
print("Precision: ", precision_score(y_train, y_pred_train))
print("F1: ", f1_score(y_train, y_pred_train))
Training set performance:

   Accuracy    Recall  Precision        F1
0     0.962  0.671827   0.889344  0.765432
None
Recall:  0.6718266253869969
Accuracy:  0.962
Precision:  0.889344262295082
F1:  0.7654320987654321
No description has been provided for this image
In [ ]:
print("Training set performance:\n")
print(model_performance_classification(lg, X_train, y_train))
print(confusion_matrix_sklearn(lg, X_train, y_train))
print("Recall: ", recall_score(y_train, y_pred_train))
print("Accuracy: ", accuracy_score(y_train, y_pred_train))
print("Precision: ", precision_score(y_train, y_pred_train))
print("F1: ", f1_score(y_train, y_pred_train))
Training set performance:

   Accuracy    Recall  Precision        F1
0     0.962  0.671827   0.889344  0.765432
None
Recall:  0.6718266253869969
Accuracy:  0.962
Precision:  0.889344262295082
F1:  0.7654320987654321
No description has been provided for this image
In [ ]:
print("Training set performance:\n")
print(model_performance_classification(lg, X_train, y_train))
print(confusion_matrix_sklearn(lg, X_train, y_train))
print("Recall: ", recall_score(y_train, y_pred_train))
print("Accuracy: ", accuracy_score(y_train, y_pred_train))
print("Precision: ", precision_score(y_train, y_pred_train))
print("F1: ", f1_score(y_train, y_pred_train))
Training set performance:

   Accuracy    Recall  Precision        F1
0     0.962  0.671827   0.889344  0.765432
None
Recall:  0.6718266253869969
Accuracy:  0.962
Precision:  0.889344262295082
F1:  0.7654320987654321
No description has been provided for this image
In [ ]:
print("Training set performance:\n")
print(model_performance_classification(lg, X_train, y_train))
print(confusion_matrix_sklearn(lg, X_train, y_train))
print("Recall: ", recall_score(y_train, y_pred_train))
print("Accuracy: ", accuracy_score(y_train, y_pred_train))
print("Precision: ", precision_score(y_train, y_pred_train))
print("F1: ", f1_score(y_train, y_pred_train))
Training set performance:

   Accuracy    Recall  Precision        F1
0     0.962  0.671827   0.889344  0.765432
None
Recall:  0.6718266253869969
Accuracy:  0.962
Precision:  0.889344262295082
F1:  0.7654320987654321
No description has been provided for this image

Model performance on test set

In [ ]:
#Predicting on the test set
y_pred_test = lg.predict(X_test)
In [ ]:
print("Test set performance:\n")
print(model_performance_classification(lg, X_test, y_test))
print(confusion_matrix_sklearn(lg, X_test, y_test))
print("Recall: ", recall_score(y_test, y_pred_test))
print("Accuracy: ", accuracy_score(y_test, y_pred_test))
print("Precision: ", precision_score(y_test, y_pred_test))
print("F1: ", f1_score(y_test, y_pred_test))
Test set performance:

   Accuracy    Recall  Precision        F1
0     0.962  0.713376   0.903226  0.797153
None
Recall:  0.7133757961783439
Accuracy:  0.962
Precision:  0.9032258064516129
F1:  0.797153024911032
No description has been provided for this image

Observations

  1. The training and testing recall scores are 0.616 and 0.62; same scores.
  2. The metric scores are looking good
  3. however,for checking the statstical validity of the model, we must buid a model using the statsmodels library
  4. Accuracy in the test model is only 1% lower than the training model.

Logistic Regression with statsmodels library

In [ ]:
# Importing necessary libraries
import statsmodels.api as sm
import pandas as pd
from sklearn.model_selection import train_test_split


# Assuming 'df' is your original DataFrame containing the data
# If it's named something else, replace 'df' with the actual name
df_logit = df.copy()  # Creating a copy of the original DataFrame and assigning it to df_logit

logistic_regression = sm.Logit(y_train, X_train.astype(float))
X = df_logit.drop("Personal_Loan", axis=1)
y = df_logit["Personal_Loan"]

X = pd.get_dummies(X, drop_first=True)

# adding constant
X = sm.add_constant(X)

# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)
In [ ]:
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd
from sklearn.model_selection import train_test_split

def logistic_regression(X, y):
    # Ensure all columns in X are numeric
    X = X.select_dtypes(include=['number'])  # Select only numeric columns

    # sm.Logit expects y to be a 1D array, not a Series
    y = y.values

    logistic_regression_model = sm.Logit(y, X)
    result = logistic_regression_model.fit()
    return result

# Assuming X and y are your original DataFrames

# Convert categorical columns in X to numeric using one-hot encoding
X = pd.get_dummies(X, drop_first=True)

# Add a constant to the independent variables
X = sm.add_constant(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Now call  logistic regression function
result = logistic_regression(X_train, y_train)
print(result.summary())
Optimization terminated successfully.
         Current function value: 0.153085
         Iterations 9
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                 4000
Model:                          Logit   Df Residuals:                     3988
Method:                           MLE   Df Model:                           11
Date:                Mon, 30 Sep 2024   Pseudo R-squ.:                  0.5080
Time:                        04:16:37   Log-Likelihood:                -612.34
converged:                       True   LL-Null:                       -1244.5
Covariance Type:            nonrobust   LLR p-value:                2.170e-264
======================================================================================
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                 -9.9667      4.192     -2.377      0.017     -18.183      -1.750
ID                 -3.403e-05   5.35e-05     -0.636      0.525      -0.000    7.08e-05
Age                    0.2023      0.060      3.354      0.001       0.084       0.320
Experience            -0.1917      0.060     -3.181      0.001      -0.310      -0.074
Income                 0.0430      0.002     18.553      0.000       0.038       0.048
ZIPCode            -4.953e-05   4.18e-05     -1.186      0.236      -0.000    3.23e-05
Family                 0.9011      0.081     11.082      0.000       0.742       1.060
CCAvg                  0.2176      0.051      4.260      0.000       0.117       0.318
Securities_Account    -0.8885      0.289     -3.072      0.002      -1.455      -0.322
CD_Account             3.6631      0.320     11.449      0.000       3.036       4.290
Online                -0.5494      0.161     -3.423      0.001      -0.864      -0.235
CreditCard            -1.2038      0.214     -5.626      0.000      -1.623      -0.784
======================================================================================
In [ ]:
#Fitting logistic regression model
logistic_regression = sm.Logit(y_train, X_train.astype(float))
logit = sm.Logit(y_train, X_train.astype(float))
result = logit.fit()
lg = logistic_regression.fit(disp=False)
print(result.summary())
Optimization terminated successfully.
         Current function value: 0.111998
         Iterations 9
                           Logit Regression Results                           
==============================================================================
Dep. Variable:          Personal_Loan   No. Observations:                 4000
Model:                          Logit   Df Residuals:                     3968
Method:                           MLE   Df Model:                           31
Date:                Mon, 30 Sep 2024   Pseudo R-squ.:                  0.6400
Time:                        04:16:41   Log-Likelihood:                -447.99
converged:                       True   LL-Null:                       -1244.5
Covariance Type:            nonrobust   LLR p-value:                     0.000
================================================================================================
                                   coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------
const                           -9.5914     16.781     -0.572      0.568     -42.481      23.298
ID                           -6.751e-05   6.26e-05     -1.079      0.281      -0.000    5.51e-05
Age                             -0.0269      0.072     -0.373      0.709      -0.168       0.115
Experience                       0.0390      0.072      0.542      0.588      -0.102       0.180
Income                           0.0617      0.004     17.574      0.000       0.055       0.069
ZIPCode                      -4.284e-05      0.000     -0.242      0.809      -0.000       0.000
Family                           0.7336      0.090      8.122      0.000       0.557       0.911
CCAvg                            0.3777      0.067      5.646      0.000       0.247       0.509
Securities_Account              -0.8681      0.345     -2.515      0.012      -1.545      -0.192
CD_Account                       3.9798      0.398      9.997      0.000       3.200       4.760
Online                          -0.7646      0.193     -3.969      0.000      -1.142      -0.387
CreditCard                      -1.2444      0.248     -5.022      0.000      -1.730      -0.759
Education_2                      3.8901      0.308     12.611      0.000       3.286       4.495
Education_3                      3.9648      0.302     13.144      0.000       3.374       4.556
County_Contra Costa County       0.4712      0.761      0.619      0.536      -1.021       1.963
County_Kern County               1.0435      0.777      1.343      0.179      -0.480       2.567
County_Los Angeles County        0.4190      0.762      0.550      0.583      -1.075       1.913
County_Marin County              0.6331      0.962      0.658      0.510      -1.253       2.519
County_Monterey County           0.6314      0.626      1.009      0.313      -0.595       1.858
County_Orange County             0.2520      0.572      0.441      0.660      -0.869       1.373
County_Other County              0.4415      0.507      0.871      0.384      -0.553       1.436
County_Riverside County          1.5687      0.819      1.916      0.055      -0.036       3.173
County_Sacramento County         0.2812      0.628      0.448      0.654      -0.950       1.512
County_San Bernardino County    -0.3703      0.854     -0.434      0.665      -2.044       1.303
County_San Diego County          0.3633      0.591      0.615      0.539      -0.795       1.522
County_San Francisco County      0.3970      0.522      0.761      0.447      -0.625       1.419
County_San Mateo County         -1.1638      0.637     -1.826      0.068      -2.413       0.086
County_Santa Barbara County      0.4375      0.675      0.648      0.517      -0.886       1.761
County_Santa Clara County        0.6041      0.386      1.565      0.118      -0.152       1.361
County_Santa Cruz County        -0.1973      0.788     -0.250      0.802      -1.742       1.347
County_Ventura County            0.5789      0.748      0.773      0.439      -0.888       2.046
County_Yolo County               0.0763      0.649      0.118      0.906      -1.196       1.349
================================================================================================

Possibly complete quasi-separation: A fraction 0.13 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
In [ ]:
#Fitting logistic regresaion model
logistic_regression = sm.Logit(y_train, X_train.astype(float))
logit = sm.Logit(y_train, X_train.astype(float))
result = logit.fit()
lg = logistic_regression.fit(disp=False)
print(result.summary())
Optimization terminated successfully.
         Current function value: 0.111998
         Iterations 9
                           Logit Regression Results                           
==============================================================================
Dep. Variable:          Personal_Loan   No. Observations:                 4000
Model:                          Logit   Df Residuals:                     3968
Method:                           MLE   Df Model:                           31
Date:                Mon, 30 Sep 2024   Pseudo R-squ.:                  0.6400
Time:                        04:16:44   Log-Likelihood:                -447.99
converged:                       True   LL-Null:                       -1244.5
Covariance Type:            nonrobust   LLR p-value:                     0.000
================================================================================================
                                   coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------
const                           -9.5914     16.781     -0.572      0.568     -42.481      23.298
ID                           -6.751e-05   6.26e-05     -1.079      0.281      -0.000    5.51e-05
Age                             -0.0269      0.072     -0.373      0.709      -0.168       0.115
Experience                       0.0390      0.072      0.542      0.588      -0.102       0.180
Income                           0.0617      0.004     17.574      0.000       0.055       0.069
ZIPCode                      -4.284e-05      0.000     -0.242      0.809      -0.000       0.000
Family                           0.7336      0.090      8.122      0.000       0.557       0.911
CCAvg                            0.3777      0.067      5.646      0.000       0.247       0.509
Securities_Account              -0.8681      0.345     -2.515      0.012      -1.545      -0.192
CD_Account                       3.9798      0.398      9.997      0.000       3.200       4.760
Online                          -0.7646      0.193     -3.969      0.000      -1.142      -0.387
CreditCard                      -1.2444      0.248     -5.022      0.000      -1.730      -0.759
Education_2                      3.8901      0.308     12.611      0.000       3.286       4.495
Education_3                      3.9648      0.302     13.144      0.000       3.374       4.556
County_Contra Costa County       0.4712      0.761      0.619      0.536      -1.021       1.963
County_Kern County               1.0435      0.777      1.343      0.179      -0.480       2.567
County_Los Angeles County        0.4190      0.762      0.550      0.583      -1.075       1.913
County_Marin County              0.6331      0.962      0.658      0.510      -1.253       2.519
County_Monterey County           0.6314      0.626      1.009      0.313      -0.595       1.858
County_Orange County             0.2520      0.572      0.441      0.660      -0.869       1.373
County_Other County              0.4415      0.507      0.871      0.384      -0.553       1.436
County_Riverside County          1.5687      0.819      1.916      0.055      -0.036       3.173
County_Sacramento County         0.2812      0.628      0.448      0.654      -0.950       1.512
County_San Bernardino County    -0.3703      0.854     -0.434      0.665      -2.044       1.303
County_San Diego County          0.3633      0.591      0.615      0.539      -0.795       1.522
County_San Francisco County      0.3970      0.522      0.761      0.447      -0.625       1.419
County_San Mateo County         -1.1638      0.637     -1.826      0.068      -2.413       0.086
County_Santa Barbara County      0.4375      0.675      0.648      0.517      -0.886       1.761
County_Santa Clara County        0.6041      0.386      1.565      0.118      -0.152       1.361
County_Santa Cruz County        -0.1973      0.788     -0.250      0.802      -1.742       1.347
County_Ventura County            0.5789      0.748      0.773      0.439      -0.888       2.046
County_Yolo County               0.0763      0.649      0.118      0.906      -1.196       1.349
================================================================================================

Possibly complete quasi-separation: A fraction 0.13 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

Observations

In this dataset, negative coefficient values indicate that as the corresponding attribute values increase, the probability of borrowing a personal loan decreases. Conversely, positive coefficients suggest that an increase in attribute values leads to a higher probability of borrowing. It is essential to address multicollinearity to obtain reliable coefficients and p-values. The p-value measures the significance of variables, with values below 0.05 indicating a statistically significant relationship that meets the 0.05 significance threshold.

Test for Multicollinearity - variation inflation factor¶

In [ ]:
# Convert all columns in X_train to numeric if possible
# If a column cannot be converted, it will be filled with NaN
X_train_numeric = X_train.apply(pd.to_numeric, errors='coerce')

# Drop columns with any non-finite values (NaNs, infinities)
# This is crucial to ensure all data is numeric and finite
X_train_numeric = X_train_numeric.dropna(axis=1)
# Reset index after dropping columns for consistency
X_train_numeric = X_train_numeric.reset_index(drop=True)
# Select only numerical features
numerical_features = X_train_numeric.select_dtypes(include=np.number).columns
X_train_numeric = X_train_numeric[numerical_features]

# Calculate VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_series = pd.Series(
    [variance_inflation_factor(X_train_numeric.values, i) for i in range(X_train_numeric.shape[1])],
    index=X_train_numeric.columns,
    dtype=float,
).sort_values(ascending=False)

print("Series after feature selection : \n\n{}\n".format(vif_series))
Series after feature selection : 

const                 3256.559374
Experience              84.036119
Age                     84.032422
Income                   1.731848
CCAvg                    1.678973
CD_Account               1.315546
Securities_Account       1.141553
CreditCard               1.103630
Online                   1.042650
Family                   1.038721
ZIPCode                  1.005523
ID                       1.002107
dtype: float64

The VIF score of both age and experience are same. This is indicating that these variables are collinear.

Removing Multicollinearity¶

In [ ]:
#removing age
# Create a copy of X_train to avoid modifying the original DataFrame
X_train1 = X_train.copy()

#check if 'Age' column exists before dropping
if 'Age' in X_train1.columns:
# Drop the 'Age' column from the copy

    X_train1 = X_train.drop("Age", axis=1)
else:
    print("'Age' column not found in X_train1.")

# Apply the same numeric preprocessing to X_train1 as you did for X_train
# Convert all columns in X_train1 to numeric if possible
X_train_numeric = X_train1.apply(pd.to_numeric, errors='coerce')

# Drop columns with any non-finite values (NaNs, infinities)
X_train_numeric = X_train_numeric.dropna(axis=1)
X_train_numeric = X_train_numeric.reset_index(drop=True)

# Select only numerical features
numerical_features = X_train_numeric.select_dtypes(include=np.number).columns
X_train_numeric = X_train_numeric[numerical_features]


# Calculate VIF for the modified DataFrames
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_series_1 = pd.Series(
    [variance_inflation_factor(X_train_numeric.values, i) for i in range(X_train_numeric.shape[1])],
    index=X_train_numeric.columns,
    dtype=float,
).sort_values(ascending=False)

print("Series after feature selection : \n\n{}\n".format(vif_series_1)) # Print vif_series_1 here
Series after feature selection : 

const                 2812.921597
Income                   1.717686
CCAvg                    1.674714
CD_Account               1.315475
Securities_Account       1.141551
CreditCard               1.103354
Online                   1.042649
Family                   1.037715
Experience               1.008569
ZIPCode                  1.005307
ID                       1.002055
dtype: float64

Now, we can proceed with model building since all the variables have vif<5.

In [ ]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
logistic_regression_1 = sm.Logit(y_train, X_train.astype(float))
lg1 = logistic_regression_1.fit(disp=False)
print(lg1.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:          Personal_Loan   No. Observations:                 4000
Model:                          Logit   Df Residuals:                     3968
Method:                           MLE   Df Model:                           31
Date:                Mon, 30 Sep 2024   Pseudo R-squ.:                  0.6400
Time:                        04:16:56   Log-Likelihood:                -447.99
converged:                       True   LL-Null:                       -1244.5
Covariance Type:            nonrobust   LLR p-value:                     0.000
================================================================================================
                                   coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------
const                           -9.5914     16.781     -0.572      0.568     -42.481      23.298
ID                           -6.751e-05   6.26e-05     -1.079      0.281      -0.000    5.51e-05
Age                             -0.0269      0.072     -0.373      0.709      -0.168       0.115
Experience                       0.0390      0.072      0.542      0.588      -0.102       0.180
Income                           0.0617      0.004     17.574      0.000       0.055       0.069
ZIPCode                      -4.284e-05      0.000     -0.242      0.809      -0.000       0.000
Family                           0.7336      0.090      8.122      0.000       0.557       0.911
CCAvg                            0.3777      0.067      5.646      0.000       0.247       0.509
Securities_Account              -0.8681      0.345     -2.515      0.012      -1.545      -0.192
CD_Account                       3.9798      0.398      9.997      0.000       3.200       4.760
Online                          -0.7646      0.193     -3.969      0.000      -1.142      -0.387
CreditCard                      -1.2444      0.248     -5.022      0.000      -1.730      -0.759
Education_2                      3.8901      0.308     12.611      0.000       3.286       4.495
Education_3                      3.9648      0.302     13.144      0.000       3.374       4.556
County_Contra Costa County       0.4712      0.761      0.619      0.536      -1.021       1.963
County_Kern County               1.0435      0.777      1.343      0.179      -0.480       2.567
County_Los Angeles County        0.4190      0.762      0.550      0.583      -1.075       1.913
County_Marin County              0.6331      0.962      0.658      0.510      -1.253       2.519
County_Monterey County           0.6314      0.626      1.009      0.313      -0.595       1.858
County_Orange County             0.2520      0.572      0.441      0.660      -0.869       1.373
County_Other County              0.4415      0.507      0.871      0.384      -0.553       1.436
County_Riverside County          1.5687      0.819      1.916      0.055      -0.036       3.173
County_Sacramento County         0.2812      0.628      0.448      0.654      -0.950       1.512
County_San Bernardino County    -0.3703      0.854     -0.434      0.665      -2.044       1.303
County_San Diego County          0.3633      0.591      0.615      0.539      -0.795       1.522
County_San Francisco County      0.3970      0.522      0.761      0.447      -0.625       1.419
County_San Mateo County         -1.1638      0.637     -1.826      0.068      -2.413       0.086
County_Santa Barbara County      0.4375      0.675      0.648      0.517      -0.886       1.761
County_Santa Clara County        0.6041      0.386      1.565      0.118      -0.152       1.361
County_Santa Cruz County        -0.1973      0.788     -0.250      0.802      -1.742       1.347
County_Ventura County            0.5789      0.748      0.773      0.439      -0.888       2.046
County_Yolo County               0.0763      0.649      0.118      0.906      -1.196       1.349
================================================================================================

Possibly complete quasi-separation: A fraction 0.13 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

Removing variables with P-value>0.05

In [ ]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Assuming X_train is your original DataFrame
X_train2 = X_train.copy() # Create a copy of X_train and assign it to X_train2

# Replace infinite values with NaN
X_train2 = X_train2.replace([np.inf, -np.inf], np.nan)
In [ ]:
#Removing Experience
X_train2 = X_train1.drop("Experience", axis=1).replace([np.inf, -np.inf], np.nan)
# Drop rows with any missing values
X_train2 = X_train2.dropna()
# Ensure y_train has the same index as X_train2 after dropping rows
y_train2 = y_train[y_train.index.isin(X_train2.index)]


logistic_regression_2 = sm.Logit(y_train, X_train.astype(float))
lg2 = logistic_regression_1.fit(disp=False)
print(lg2.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:          Personal_Loan   No. Observations:                 4000
Model:                          Logit   Df Residuals:                     3968
Method:                           MLE   Df Model:                           31
Date:                Mon, 30 Sep 2024   Pseudo R-squ.:                  0.6400
Time:                        04:17:02   Log-Likelihood:                -447.99
converged:                       True   LL-Null:                       -1244.5
Covariance Type:            nonrobust   LLR p-value:                     0.000
================================================================================================
                                   coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------
const                           -9.5914     16.781     -0.572      0.568     -42.481      23.298
ID                           -6.751e-05   6.26e-05     -1.079      0.281      -0.000    5.51e-05
Age                             -0.0269      0.072     -0.373      0.709      -0.168       0.115
Experience                       0.0390      0.072      0.542      0.588      -0.102       0.180
Income                           0.0617      0.004     17.574      0.000       0.055       0.069
ZIPCode                      -4.284e-05      0.000     -0.242      0.809      -0.000       0.000
Family                           0.7336      0.090      8.122      0.000       0.557       0.911
CCAvg                            0.3777      0.067      5.646      0.000       0.247       0.509
Securities_Account              -0.8681      0.345     -2.515      0.012      -1.545      -0.192
CD_Account                       3.9798      0.398      9.997      0.000       3.200       4.760
Online                          -0.7646      0.193     -3.969      0.000      -1.142      -0.387
CreditCard                      -1.2444      0.248     -5.022      0.000      -1.730      -0.759
Education_2                      3.8901      0.308     12.611      0.000       3.286       4.495
Education_3                      3.9648      0.302     13.144      0.000       3.374       4.556
County_Contra Costa County       0.4712      0.761      0.619      0.536      -1.021       1.963
County_Kern County               1.0435      0.777      1.343      0.179      -0.480       2.567
County_Los Angeles County        0.4190      0.762      0.550      0.583      -1.075       1.913
County_Marin County              0.6331      0.962      0.658      0.510      -1.253       2.519
County_Monterey County           0.6314      0.626      1.009      0.313      -0.595       1.858
County_Orange County             0.2520      0.572      0.441      0.660      -0.869       1.373
County_Other County              0.4415      0.507      0.871      0.384      -0.553       1.436
County_Riverside County          1.5687      0.819      1.916      0.055      -0.036       3.173
County_Sacramento County         0.2812      0.628      0.448      0.654      -0.950       1.512
County_San Bernardino County    -0.3703      0.854     -0.434      0.665      -2.044       1.303
County_San Diego County          0.3633      0.591      0.615      0.539      -0.795       1.522
County_San Francisco County      0.3970      0.522      0.761      0.447      -0.625       1.419
County_San Mateo County         -1.1638      0.637     -1.826      0.068      -2.413       0.086
County_Santa Barbara County      0.4375      0.675      0.648      0.517      -0.886       1.761
County_Santa Clara County        0.6041      0.386      1.565      0.118      -0.152       1.361
County_Santa Cruz County        -0.1973      0.788     -0.250      0.802      -1.742       1.347
County_Ventura County            0.5789      0.748      0.773      0.439      -0.888       2.046
County_Yolo County               0.0763      0.649      0.118      0.906      -1.196       1.349
================================================================================================

Possibly complete quasi-separation: A fraction 0.13 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
In [ ]:
import statsmodels.api as sm

# Initial list of columns
cols = X_train2.columns.tolist()

# Set an initial max p-value
max_p_value = 1

# Loop to drop features with high p-values
while len(cols) > 0:
    # Define the training set with the current columns
    X_train_aux = X_train2[cols]
    X_train_aux = sm.add_constant(X_train_aux)  # Add constant for intercept

    # Fit the logistic regression model
    model = sm.Logit(y_train2, X_train_aux.astype(float)).fit(disp=False)

    # Get p-values and the maximum p-value
    p_values = model.pvalues
    max_p_value = max(p_values)

    # Name of the variable with maximum p-value
    feature_with_p_max = p_values.idxmax()

    # Check if max p-value is greater than 0.05
    if max_p_value > 0.05:
        cols.remove(feature_with_p_max)  # Drop the feature
    else:
        break  # Exit loop if all features are significant

# Selected features after dropping
selected_features = cols
print(selected_features)
['const', 'Income', 'Family', 'CCAvg', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'Education_2', 'Education_3', 'County_San Mateo County']
In [ ]:
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor


X_train3 = X_train2[selected_features]

logistic_regression_3 = sm.Logit(y_train, X_train3.astype(float))
lg3 = logistic_regression_3.fit()

print(lg3.summary())
Optimization terminated successfully.
         Current function value: 0.113747
         Iterations 9
                           Logit Regression Results                           
==============================================================================
Dep. Variable:          Personal_Loan   No. Observations:                 4000
Model:                          Logit   Df Residuals:                     3989
Method:                           MLE   Df Model:                           10
Date:                Mon, 30 Sep 2024   Pseudo R-squ.:                  0.6344
Time:                        04:17:12   Log-Likelihood:                -454.99
converged:                       True   LL-Null:                       -1244.5
Covariance Type:            nonrobust   LLR p-value:                     0.000
===========================================================================================
                              coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------
const                     -13.5391      0.672    -20.144      0.000     -14.856     -12.222
Income                      0.0610      0.003     17.920      0.000       0.054       0.068
Family                      0.7071      0.089      7.990      0.000       0.534       0.881
CCAvg                       0.3526      0.065      5.446      0.000       0.226       0.479
Securities_Account         -0.8534      0.335     -2.548      0.011      -1.510      -0.197
CD_Account                  3.9343      0.390     10.081      0.000       3.169       4.699
Online                     -0.7750      0.189     -4.110      0.000      -1.145      -0.405
CreditCard                 -1.2354      0.244     -5.071      0.000      -1.713      -0.758
Education_2                 3.8187      0.301     12.667      0.000       3.228       4.410
Education_3                 3.9173      0.292     13.427      0.000       3.345       4.489
County_San Mateo County    -1.6013      0.564     -2.838      0.005      -2.707      -0.495
===========================================================================================

Possibly complete quasi-separation: A fraction 0.12 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

Observations¶

  1. No feature now have a p-value greater than 0.5.therefore,the features in X_train3 can be considered as the final ones and lg3 as the final model.

  2. The coefficients of all the selected features are positive. Therefore, an increase in these will lead to an increase in the probability of a customer accepting the personal loan.

Converting coefficients to odds¶

In [ ]:
#converting coeffficients to odds as they are in terms of log(odd)
odds = np.exp(lg3.params)
odds

#finding the percentage change
percentage_change = odds * 100 - 100
percentage_change

#adding the odds to a dataframe
odds_df = pd.DataFrame(odds, columns=["Odds"])
odds_df["Percentage Change"] = percentage_change
odds_df
Out[ ]:
Odds Percentage Change
const 0.000001 -99.999868
Income 1.062851 6.285052
Family 2.028157 102.815750
CCAvg 1.422737 42.273701
Securities_Account 0.425946 -57.405445
CD_Account 51.127910 5012.791026
Online 0.460703 -53.929715
CreditCard 0.290704 -70.929572
Education_2 45.544678 4454.467760
Education_3 50.263304 4926.330357
County_San Mateo County 0.201633 -79.836729

Coefficient interpretations¶

  • Features such as income,fanily,cc_avg, CD_account, Riversie county will increase the chance of taking personal loan
  • Features such as securities_account,online,creditcard,San Mateo county will decrease the chance to take personal loan.

Model Performance Improvement¶

In [ ]:
 #defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [ ]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    )

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
In [ ]:
#Model0 performance on test set
model0 = DecisionTreeClassifier(random_state=1)
if (y == 1).any():  # Check if any value in y is equal to 1
    model0 = DecisionTreeClassifier(random_state=1, class_weight="balanced")
else:
    model0 = DecisionTreeClassifier(random_state=1)
    X = df.drop(columns=['id', 'personal_loan'])
    y = df['personal_loan']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    X_train = X_train[~y_train.isnull()] # Select rows in X_train where y_train is not null
    y_train = y_train.dropna()
    model0 = DecisionTreeClassifier(random_state=1)
    model0.fit(X_train, y_train)
    model0

    #convert categorical data to numerical data in a dummy variable
    X=pd.get_dummies(X, columns=X.filter(items=["age", "experience", "income", "cc_avg", "zipcode", "family", "mortgage", "education"]).columns.tolist(), drop_first=True)
    X.head()
In [ ]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
model0 = DecisionTreeClassifier(random_state=1)
model0.fit(X_train, y_train)
print(confusion_matrix(y_train, model0.predict(X_train)))
print(confusion_matrix(y_test, model0.predict(X_test)))
[[3625    0]
 [   0  375]]
[[887   8]
 [  7  98]]
In [ ]:
decision_tree_classifier = model0
decision_tree_classifier

model0, X_train, y_train = decision_tree_classifier, X_train, y_train
decision_tree_perf_train = model_performance_classification(
    model0, X_train, y_train
)
decision_tree_perf_train
Out[ ]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [ ]:
#Recall treatment
log_reg_model_test_perf = model_performance_classification(
    model0, X_test, y_test
)
log_reg_model_test_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.985 0.933333 0.924528 0.92891
In [ ]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    from sklearn.metrics import confusion_matrix
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    import pandas as pd

predictors = X_test
target = y_test
model = model0

y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
# Calculate percentages
percentages = ["{0:.2%}".format(item/cm.sum()) for item in cm.flatten()]
# Create labels array with same shape as cm
labels = np.asarray(percentages).reshape(cm.shape)

plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
plt.show()
y_test = model0.predict(X_test)
print('Y_test.shape')
print(y_test.shape)
print(y_test)

confusion_matrix_sklearn(model0, X_test, y_test)
No description has been provided for this image
Y_test.shape
(1000,)
[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0
 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0
 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 0
 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0]
In [ ]:
model1 = DecisionTreeClassifier(random_state=42, class_weight="balanced")
model1.fit(X_train, y_train)
Out[ ]:
DecisionTreeClassifier(class_weight='balanced', random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', random_state=42)
In [ ]:
model1 = DecisionTreeClassifier(random_state=42, class_weight="balanced")
model1.fit(X_train, y_train)
#calculate the pruning path
from sklearn.tree import _tree
cost_complexity_pruning_path = model1.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = cost_complexity_pruning_path.ccp_alphas, cost_complexity_pruning_path.impurities
ccp_alphas, impurities
Out[ ]:
(array([0.00000000e+00, 4.90029473e-19, 2.78704263e-18, 1.06581410e-17,
        2.62012634e-17, 2.06057393e-16, 1.35135135e-04, 1.37221269e-04,
        1.37845646e-04, 2.50000000e-04, 2.53995931e-04, 2.56091546e-04,
        2.56880734e-04, 2.62295082e-04, 2.65842784e-04, 2.70270270e-04,
        2.75113122e-04, 3.34314376e-04, 4.59443075e-04, 5.35885167e-04,
        6.37304276e-04, 7.15822856e-04, 7.50000000e-04, 7.65231298e-04,
        8.33595329e-04, 8.38147199e-04, 1.06035930e-03, 1.15076259e-03,
        1.17711725e-03, 1.18558824e-03, 1.37136825e-03, 1.46056543e-03,
        1.68674699e-03, 1.79020979e-03, 1.82608696e-03, 2.61000000e-03,
        2.75098565e-03, 2.85978730e-03, 3.36153030e-03, 3.61896386e-03,
        4.01666137e-03, 4.89426453e-03, 4.91632511e-03, 3.09771894e-02,
        3.83996398e-02, 2.79838964e-01]),
 array([-7.89948949e-15, -7.89899946e-15, -7.89621242e-15, -7.88555428e-15,
        -7.85935302e-15, -7.65329562e-15,  2.70270270e-04,  5.44712809e-04,
         8.20404101e-04,  1.07040410e-03,  1.83239189e-03,  3.36894117e-03,
         4.65334484e-03,  4.91563992e-03,  6.51069662e-03,  6.78096689e-03,
         7.33119314e-03,  1.00057081e-02,  1.13840374e-02,  1.19199225e-02,
         1.25572268e-02,  1.39888725e-02,  1.47388725e-02,  1.55041038e-02,
         1.63376992e-02,  1.80139936e-02,  2.01347122e-02,  2.12854747e-02,
         2.24625920e-02,  2.48337685e-02,  2.62051367e-02,  2.91262676e-02,
         3.08130146e-02,  3.26032244e-02,  3.44293113e-02,  3.70393113e-02,
         4.25412826e-02,  4.82608572e-02,  5.16223875e-02,  5.52413514e-02,
         5.92580127e-02,  6.41522773e-02,  7.39849275e-02,  1.04962117e-01,
         2.20161036e-01,  5.00000000e-01]))
In [ ]:
#convert numerical data to categorical data
X=pd.get_dummies(X, columns=X.filter(items=["age", "experience", "income", "cc_avg", "zipcode", "family", "mortgage", "education"]).columns.tolist(), drop_first=True)
confusion_matrix_sklearn(model1, X_train, y_train)
confusion_matrix_sklearn(model1, X_test, y_test)
In [ ]:
decision_tree_perf_train = model_performance_classification(
    model1, X_train, y_train
)
decision_tree_perf_train
Out[ ]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [ ]:
confusion_matrix_sklearn(model1, X_test, y_test)

Observations

  1. recall of model has increased but the other metric have reduced.
  2. The model is still performing well.
In [ ]:
decision_tree_perf_test = model_performance_classification(
    model1, X_test, y_test
)
decision_tree_perf_test
Out[ ]:
Accuracy Recall Precision F1
0 0.982 0.877358 0.94898 0.911765
In [ ]:
#Recall treatment
log_reg_model_test_perf = model_performance_classification(
    model1, X_test, y_test
)
log_reg_model_test_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.982 0.877358 0.94898 0.911765
In [ ]:
# Define the parameters of the tree to iterate over
max_depth_values = np.arange(2, 7, 2)
max_leaf_nodes_values = [50, 75, 150, 250]
min_samples_split_values = [10, 30, 50, 70]

# Initialize variables to store the best model and its performance
best_estimator = None
best_score_diff = float('inf')
best_test_score = 0.0

# Iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
    for max_leaf_nodes in max_leaf_nodes_values:
        for min_samples_split in min_samples_split_values:
            # Create a decision tree classifier with the current parameter values
            estimator = DecisionTreeClassifier(
                max_depth=max_depth,
                max_leaf_nodes=max_leaf_nodes,
                min_samples_split=min_samples_split,
                class_weight='balanced',
                random_state=42
            )

            # Fit the model to the training data
            estimator.fit(X_train, y_train)

            # Make predictions on the training and test sets
            y_train_pred = estimator.predict(X_train)
            y_test_pred = estimator.predict(X_test)
In [ ]:
# creating an instance of the best model
model2 = DecisionTreeClassifier(
    max_depth=max_depth,
    max_leaf_nodes=max_leaf_nodes,
    min_samples_split=min_samples_split,
    class_weight='balanced',
    random_state=42
)
model2 = best_estimator

# fitting the best model to the training data
from sklearn.tree import DecisionTreeClassifier
model2 = DecisionTreeClassifier(random_state=42)
model2.fit(X_train, y_train)
model2  # printing the best model
Out[ ]:
DecisionTreeClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=42)
In [ ]:
decision_tree_tune_perf_test = model_performance_classification(
    model2, X_test, y_test
)
decision_tree_tune_perf_test
Out[ ]:
Accuracy Recall Precision F1
0 0.997 0.981132 0.990476 0.985782
In [ ]:
decision_tree_tune_perf_train = model_performance_classification(
    model2, X_train, y_train
)
decision_tree_tune_perf_train
Out[ ]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [ ]:
confusion_matrix_sklearn(model2, X_test, y_test)

Model performance on training set

In [ ]:
decision_tree_tune_perf_test = model_performance_classification(
    model2, X_test, y_test
)
decision_tree_tune_perf_test
Out[ ]:
Accuracy Recall Precision F1
0 0.997 0.981132 0.990476 0.985782

Model performance on test set

In [ ]:
# recall treatment
log_reg_model_test_perf = model_performance_classification(
    model2, X_test, y_test
)
log_reg_model_test_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.997 0.981132 0.990476 0.985782

conclusion

  1. The values are comparable across all three logistic regression models.

Model Building - decision Tree¶

In [ ]:
#Viewing the dataset data_dtree that was created prior to the outlier treatment

df_dtree.head()
Out[ ]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard County
0 1 25 1 49 91107 4 1.6 1 No 0 1 0 0 0 Los Angeles County
1 2 45 19 34 90089 3 1.5 1 No 0 1 0 0 0 Los Angeles County
2 3 39 15 11 94720 1 1.0 1 No 0 0 0 0 0 Alameda County
3 4 35 9 100 94112 1 2.7 2 No 0 0 0 0 0 San Francisco County
4 5 35 8 45 91330 4 1.0 2 No 0 0 0 0 1 Los Angeles County
In [ ]:
X = df_dtree.drop(['Personal_Loan'], axis=1) #Creating independent variables
y = df_dtree["Personal_Loan"]
#Creating training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    df_dtree.drop("Personal_Loan", axis=1),
    df_dtree["Personal_Loan"],
    test_size=0.20,
    random_state=42,
)

#creating dummy variables for categorical features
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)
In [ ]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("\n Percentage of classes in training set : ")
print(y_train.value_counts(normalize=True))
print("\n Percentage of classes in test set : ")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (4000, 33)
Shape of test set :  (1000, 33)

 Percentage of classes in training set : 
Personal_Loan
0    0.90625
1    0.09375
Name: proportion, dtype: float64

 Percentage of classes in test set : 
Personal_Loan
0    0.895
1    0.105
Name: proportion, dtype: float64

We have well balanced classes of thh depedent variable in both train and test sets.

Model evaluation criterion¶

Model can make wrong prediction as : Predicting a customer will borrow the personal loan but in reality the customer does not borrow. Predicting a customer will not borrow the personal loan but in reality the customer will borrow the laon.

Prediction of concern : The second prediction is our major concern as the AllLife Bank is looking to expand it's loan business, every single customer who will borrow the personal loan is very important. Hence, mistakes in the second prediction (i.e. False Negatives) has to be considerably low How to reduce False Negatives? By recalling score should be maximized. Greater the Recall score higher the chances of predicting the potential customers who will borrow a personal loan

In [ ]:
def model_performance_classification(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [ ]:
#defining a function to plot the oncfusion_matrix of a classification model

def model_performance_classification(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [ ]:
# Building a Decision Tree using default 'gini' criteria to split

model = DecisionTreeClassifier(
    criterion="gini", class_weight={0: 0.15, 1: 0.85}, random_state=42
)
model.fit(X_train, y_train)

 # defining higher weightage to 1 in the dependent variable
Out[ ]:
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=42)
In [ ]:
# Fitting the model on the training set

model.fit(X_train, y_train)
Out[ ]:
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=42)

Model Performance on training set

In [ ]:
confusion_matrix(y_train, model.predict(X_train))
import statsmodels.api as sm
def confision_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    )
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
    plt.show()
    return
    confision_matrix_sklearn(model, X_train, y_train)
In [ ]:
decision_tree_perf_train = model_performance_classification(model, X_train, y_train)
decision_tree_perf_train
Out[ ]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0

Observation The deicison tree is fully grown.Tehrefore the model is overfit and it is able to classify all the data poits on the training set with no errors.

Model Performance on test set

In [ ]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    )
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
    plt.show()
    return
    confision
    matrix_sklearn(model, X_test, y_test)
    return
In [ ]:
decision_tree_performance_test = model_performance_classification(model, X_test, y_test) # Replace y_est with y_test
decision_tree_performance_test
Out[ ]:
Accuracy Recall Precision F1
0 0.984 0.92381 0.92381 0.92381

Found disparity in model performance is due to overfitting

In [ ]:
plt.figure(figsize=(12, 8))

# Assuming X_train is a pandas DataFrame containing your training data
# and you want to use its column names as feature names
feature_names = X_train.columns  # Define feature_names here

out = tree.plot_tree(
    model,
    feature_names=feature_names,  # Now feature_names is defined
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)


# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
No description has been provided for this image
In [ ]:
# Text report showing the rules of a decision tree

print(tree.export_text(model, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [424.80, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CCAvg <= 4.25
|   |   |   |--- Income <= 82.50
|   |   |   |   |--- Experience <= 8.50
|   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |--- weights: [0.00, 4.25] class: 1
|   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |--- weights: [1.05, 0.00] class: 0
|   |   |   |   |--- Experience >  8.50
|   |   |   |   |   |--- ZIPCode <= 94657.50
|   |   |   |   |   |   |--- ZIPCode <= 91257.00
|   |   |   |   |   |   |   |--- ID <= 1184.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |   |   |   |--- ID >  1184.50
|   |   |   |   |   |   |   |   |--- weights: [1.20, 0.00] class: 0
|   |   |   |   |   |   |--- ZIPCode >  91257.00
|   |   |   |   |   |   |   |--- weights: [6.75, 0.00] class: 0
|   |   |   |   |   |--- ZIPCode >  94657.50
|   |   |   |   |   |   |--- Income <= 63.50
|   |   |   |   |   |   |   |--- weights: [0.90, 0.00] class: 0
|   |   |   |   |   |   |--- Income >  63.50
|   |   |   |   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |   |   |   |   |   |--- Family >  2.50
|   |   |   |   |   |   |   |   |--- Age <= 54.50
|   |   |   |   |   |   |   |   |   |--- County_Alameda County <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |   |   |   |   |   |--- County_Alameda County >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |   |   |   |   |--- Age >  54.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |--- Income >  82.50
|   |   |   |   |--- ID <= 934.50
|   |   |   |   |   |--- County_Santa Clara County <= 0.50
|   |   |   |   |   |   |--- weights: [1.65, 0.00] class: 0
|   |   |   |   |   |--- County_Santa Clara County >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |--- ID >  934.50
|   |   |   |   |   |--- County_Santa Clara County <= 0.50
|   |   |   |   |   |   |--- County_Contra Costa County <= 0.50
|   |   |   |   |   |   |   |--- County_Yolo County <= 0.50
|   |   |   |   |   |   |   |   |--- County_Orange County <= 0.50
|   |   |   |   |   |   |   |   |   |--- County_Santa Barbara County <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- County_Marin County <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 8
|   |   |   |   |   |   |   |   |   |   |--- County_Marin County >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- County_Santa Barbara County >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- County_Orange County >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.30, 0.00] class: 0
|   |   |   |   |   |   |   |--- County_Yolo County >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.30, 0.00] class: 0
|   |   |   |   |   |   |--- County_Contra Costa County >  0.50
|   |   |   |   |   |   |   |--- weights: [0.30, 0.00] class: 0
|   |   |   |   |   |--- County_Santa Clara County >  0.50
|   |   |   |   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |--- CCAvg >  4.25
|   |   |   |--- weights: [5.85, 0.00] class: 0
|--- Income >  98.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- Income <= 99.50
|   |   |   |   |--- CCAvg <= 4.20
|   |   |   |   |   |--- weights: [0.60, 0.00] class: 0
|   |   |   |   |--- CCAvg >  4.20
|   |   |   |   |   |--- ZIPCode <= 90139.50
|   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |   |--- ZIPCode >  90139.50
|   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |--- Income >  99.50
|   |   |   |   |--- Income <= 103.50
|   |   |   |   |   |--- Securities_Account <= 0.50
|   |   |   |   |   |   |--- weights: [2.10, 0.00] class: 0
|   |   |   |   |   |--- Securities_Account >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |--- Income >  103.50
|   |   |   |   |   |--- County_Santa Clara County <= 0.50
|   |   |   |   |   |   |--- weights: [69.45, 0.00] class: 0
|   |   |   |   |   |--- County_Santa Clara County >  0.50
|   |   |   |   |   |   |--- weights: [8.40, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- Income <= 113.50
|   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |--- County_Orange County <= 0.50
|   |   |   |   |   |   |--- weights: [2.70, 0.00] class: 0
|   |   |   |   |   |--- County_Orange County >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |--- Family >  3.50
|   |   |   |   |   |--- CCAvg <= 4.15
|   |   |   |   |   |   |--- County_Los Angeles County <= 0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.70] class: 1
|   |   |   |   |   |   |--- County_Los Angeles County >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.70] class: 1
|   |   |   |   |   |--- CCAvg >  4.15
|   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |--- Income >  113.50
|   |   |   |   |--- weights: [0.00, 50.15] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 114.50
|   |   |   |--- CCAvg <= 2.45
|   |   |   |   |--- Income <= 106.50
|   |   |   |   |   |--- weights: [5.10, 0.00] class: 0
|   |   |   |   |--- Income >  106.50
|   |   |   |   |   |--- Experience <= 31.50
|   |   |   |   |   |   |--- Experience <= 3.50
|   |   |   |   |   |   |   |--- weights: [1.35, 0.00] class: 0
|   |   |   |   |   |   |--- Experience >  3.50
|   |   |   |   |   |   |   |--- County_Santa Clara County <= 0.50
|   |   |   |   |   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |   |   |   |   |--- Family <= 3.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.05, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Family >  3.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |   |   |   |   |--- Online >  0.50
|   |   |   |   |   |   |   |   |   |--- CreditCard <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- ZIPCode <= 95272.00
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- ZIPCode >  95272.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- CreditCard >  0.50
|   |   |   |   |   |   |   |   |   |   |--- Experience <= 13.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- Experience >  13.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.30, 0.00] class: 0
|   |   |   |   |   |   |   |--- County_Santa Clara County >  0.50
|   |   |   |   |   |   |   |   |--- Experience <= 5.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Experience >  5.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.60, 0.00] class: 0
|   |   |   |   |   |--- Experience >  31.50
|   |   |   |   |   |   |--- weights: [1.80, 0.00] class: 0
|   |   |   |--- CCAvg >  2.45
|   |   |   |   |--- ID <= 2303.00
|   |   |   |   |   |--- ID <= 163.00
|   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |--- ID >  163.00
|   |   |   |   |   |   |--- County_Other County <= 0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 12.75] class: 1
|   |   |   |   |   |   |--- County_Other County >  0.50
|   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |--- ID >  2303.00
|   |   |   |   |   |--- Experience <= 35.50
|   |   |   |   |   |   |--- Age <= 34.50
|   |   |   |   |   |   |   |--- Experience <= 2.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |   |   |   |--- Experience >  2.50
|   |   |   |   |   |   |   |   |--- weights: [0.90, 0.00] class: 0
|   |   |   |   |   |   |--- Age >  34.50
|   |   |   |   |   |   |   |--- County_Santa Clara County <= 0.50
|   |   |   |   |   |   |   |   |--- Income <= 100.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Income >  100.00
|   |   |   |   |   |   |   |   |   |--- ZIPCode <= 90154.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- ZIPCode >  90154.50
|   |   |   |   |   |   |   |   |   |   |--- CreditCard <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 5.95] class: 1
|   |   |   |   |   |   |   |   |   |   |--- CreditCard >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |--- County_Santa Clara County >  0.50
|   |   |   |   |   |   |   |   |--- Experience <= 11.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Experience >  11.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |--- Experience >  35.50
|   |   |   |   |   |   |--- ZIPCode <= 94275.50
|   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |--- ZIPCode >  94275.50
|   |   |   |   |   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |--- Income >  114.50
|   |   |   |--- Income <= 116.50
|   |   |   |   |--- Mortgage_Yes <= 0.50
|   |   |   |   |   |--- ID <= 304.00
|   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |--- ID >  304.00
|   |   |   |   |   |   |--- ZIPCode <= 90060.00
|   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |--- ZIPCode >  90060.00
|   |   |   |   |   |   |   |--- Experience <= 32.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 6.80] class: 1
|   |   |   |   |   |   |   |--- Experience >  32.00
|   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |--- Mortgage_Yes >  0.50
|   |   |   |   |   |--- Age <= 32.00
|   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |--- Age >  32.00
|   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |--- Income >  116.50
|   |   |   |   |--- weights: [0.00, 201.45] class: 1

The tree above is fully grown hence,complex and difficult to interpret.

In [ ]:
#Gini importance of features in the tree building

print(
    pd.DataFrame(
        model.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                                       Imp
Income                        5.619161e-01
Family                        1.563385e-01
Education                     1.291092e-01
CCAvg                         8.804473e-02
Experience                    1.766618e-02
ID                            1.029078e-02
County_Santa Clara County     7.854868e-03
ZIPCode                       7.288236e-03
County_Orange County          4.430490e-03
Age                           3.518669e-03
Securities_Account            3.011105e-03
CreditCard                    2.181233e-03
Online                        2.074456e-03
Mortgage_Yes                  1.261141e-03
County_Yolo County            1.177442e-03
County_Contra Costa County    1.143064e-03
County_Other County           7.377706e-04
County_Sacramento County      7.049808e-04
County_Marin County           6.303210e-04
County_Santa Barbara County   6.206978e-04
County_Los Angeles County     2.817669e-18
County_Alameda County         1.408835e-18
County_San Mateo County       0.000000e+00
County_Ventura County         0.000000e+00
County_Santa Cruz County      0.000000e+00
CD_Account                    0.000000e+00
County_San Bernardino County  0.000000e+00
County_San Francisco County   0.000000e+00
County_San Diego County       0.000000e+00
County_Riverside County       0.000000e+00
Mortgage_No                   0.000000e+00
County_Monterey County        0.000000e+00
County_Kern County            0.000000e+00
In [ ]:
importances = model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 10))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image
  1. According to the decision tree model, income if the most important variable for predicting if a customer will take a personal Loan.

Model performance improvement Gridsearch for Hyperparameter tuning to reduce overfitting

In [ ]:
from sklearn.metrics import make_scorer, recall_score # Import make_scorer and recall_score

#choose the type of classifier
estimator = DecisionTreeClassifier(random_state=42)

#model = DecisionTreeClassifier(random_state=42

#grid of patameter to choose from
parameters = {
    "max_depth": [5, 10, 15, None],
    "criterion": ["entropy", "gini"],
    "splitter": ["best", "random"],
    "min_impurity_decrease": [0.00001, 0.0001, 0.01],
}

# Type of scoring used to compare parameter combinations
scorer = make_scorer(recall_score)

# Run the grid search
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import GridSearchCV
grid_object = GridSearchCV(estimator, parameters, scoring=scorer, cv=5)
df_grid = grid_object.fit(X_train, y_train)
# The error was in the next line, grid_obj was not defined,
# it should be grid_object
grid_object = grid_object.fit(X_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_object.best_estimator_

# Fit the best algorithm to the data
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.85}, max_depth=5,
                       min_impurity_decrease=1e-05, random_state=42)
Out[ ]:
DecisionTreeClassifier(class_weight={0: 0.85}, max_depth=5,
                       min_impurity_decrease=1e-05, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.85}, max_depth=5,
                       min_impurity_decrease=1e-05, random_state=42)

Model performance on training set

In [ ]:
decision_tree_tune_performance_train = model_performance_classification(
    estimator, X_train, y_train
)
decision_tree_tune_perf_train
Out[ ]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [ ]:
def  confusion_matrix_sklearn(model, predictors, target):

    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np
    np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    )
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
    plt.show()
    return
    confusion_matrix_sklearn(estimator, X_train, y_train)
In [ ]:
# Model performance on test set

decision_tree_tune_perf_test = model_performance_classification(
    estimator, X_test, y_test
)
decision_tree_tune_perf_test
Out[ ]:
Accuracy Recall Precision F1
0 0.986 0.904762 0.959596 0.931373
In [ ]:
from sklearn.metrics import make_scorer, recall_score # Import make_scorer and recall_score

#choose the type of classifier
estimator = DecisionTreeClassifier(random_state=42)

#model = DecisionTreeClassifier(random_state=42

#grid of patameter to choose from
parameters = {
    "max_depth": [5, 10, 15, None],
    "criterion": ["entropy", "gini"],
    "splitter": ["best", "random"],
    "min_impurity_decrease": [0.00001, 0.0001, 0.01],
}

# Type of scoring used to compare parameter combinations
scorer = make_scorer(recall_score)

# Run the grid search
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import GridSearchCV
grid_obj = GridSearchCV(estimator, parameters, scoring=scorer, cv=5)

df_grid = grid_obj.fit(X_train, y_train)
In [ ]:
# Access the best estimator
best_estimator = grid_obj.best_estimator_
def confusion_matrix_sklearn(model, predictors, target):
    """
    Generates and displays a confusion matrix using Seaborn's heatmap.

    Args:
        model: The trained classification model.
        predictors: The predictor variables (X).
        target: The target variable (y).
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)

    # Reshape labels to match cm shape
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(cm.shape) # Reshape labels to match cm's shape

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
    plt.show()
    return

confusion_matrix_sklearn(best_estimator, X_test, y_test)
No description has been provided for this image

Visualizing the decision tree¶

In [ ]:
plt.figure(figsize=(12, 8))
# Pass the best_estimator instead of estimator
out = tree.plot_tree(
    best_estimator,  # Changed to best_estimator
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)

# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
No description has been provided for this image
In [ ]:
# Test report showing the rules of a decision tree
print(tree.export_text(best_estimator, feature_names=feature_names, show_weights=True))
|--- Income <= 113.50
|   |--- CCAvg <= 2.95
|   |   |--- Income <= 106.50
|   |   |   |--- weights: [2892.00, 0.00] class: 0
|   |   |--- Income >  106.50
|   |   |   |--- County_Santa Barbara County <= 0.50
|   |   |   |   |--- Education <= 1.50
|   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |--- weights: [35.00, 0.00] class: 0
|   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Education >  1.50
|   |   |   |   |   |--- County_Other County <= 0.50
|   |   |   |   |   |   |--- County_Ventura County <= 0.50
|   |   |   |   |   |   |   |--- CCAvg <= 2.45
|   |   |   |   |   |   |   |   |--- ID <= 1276.50
|   |   |   |   |   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- ID >  1276.50
|   |   |   |   |   |   |   |   |   |--- ZIPCode <= 90069.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- ZIPCode >  90069.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [20.00, 4.00] class: 0
|   |   |   |   |   |   |   |--- CCAvg >  2.45
|   |   |   |   |   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Family >  1.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |--- County_Ventura County >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- County_Other County >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |--- County_Santa Barbara County >  0.50
|   |   |   |   |--- Mortgage_Yes <= 0.50
|   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |--- Mortgage_Yes >  0.50
|   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- Income <= 82.50
|   |   |   |   |--- Age <= 28.00
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  28.00
|   |   |   |   |   |--- Experience <= 8.50
|   |   |   |   |   |   |--- Family <= 3.00
|   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |--- Family >  3.00
|   |   |   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |   |   |   |--- Experience >  8.50
|   |   |   |   |   |   |--- ZIPCode <= 94657.50
|   |   |   |   |   |   |   |--- ID <= 1071.50
|   |   |   |   |   |   |   |   |--- ID <= 1030.50
|   |   |   |   |   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- ID >  1030.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- ID >  1071.50
|   |   |   |   |   |   |   |   |--- weights: [50.00, 0.00] class: 0
|   |   |   |   |   |   |--- ZIPCode >  94657.50
|   |   |   |   |   |   |   |--- ZIPCode <= 94714.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- ZIPCode >  94714.50
|   |   |   |   |   |   |   |   |--- CCAvg <= 3.10
|   |   |   |   |   |   |   |   |   |--- Education <= 2.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Education >  2.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- CCAvg >  3.10
|   |   |   |   |   |   |   |   |   |--- weights: [13.00, 0.00] class: 0
|   |   |   |--- Income >  82.50
|   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |--- Experience <= 33.50
|   |   |   |   |   |   |--- County_Riverside County <= 0.50
|   |   |   |   |   |   |   |--- Experience <= 3.50
|   |   |   |   |   |   |   |   |--- ID <= 2248.00
|   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- ID >  2248.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |   |--- Experience >  3.50
|   |   |   |   |   |   |   |   |--- County_Sacramento County <= 0.50
|   |   |   |   |   |   |   |   |   |--- Income <= 110.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [63.00, 1.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Income >  110.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [8.00, 2.00] class: 0
|   |   |   |   |   |   |   |   |--- County_Sacramento County >  0.50
|   |   |   |   |   |   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Family >  1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- County_Riverside County >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |--- Experience >  33.50
|   |   |   |   |   |   |--- Education <= 1.50
|   |   |   |   |   |   |   |--- ZIPCode <= 92321.50
|   |   |   |   |   |   |   |   |--- ID <= 1537.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |--- ID >  1537.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- ZIPCode >  92321.50
|   |   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |--- Education >  1.50
|   |   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |--- Family >  2.50
|   |   |   |   |   |--- Age <= 57.00
|   |   |   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 13.00] class: 1
|   |   |   |   |   |   |--- Online >  0.50
|   |   |   |   |   |   |   |--- ID <= 2630.00
|   |   |   |   |   |   |   |   |--- Income <= 102.00
|   |   |   |   |   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Income >  102.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- ID >  2630.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |   |--- Age >  57.00
|   |   |   |   |   |   |--- CCAvg <= 3.20
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- CCAvg >  3.20
|   |   |   |   |   |   |   |--- ZIPCode <= 91383.00
|   |   |   |   |   |   |   |   |--- Experience <= 34.00
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Experience >  34.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- ZIPCode >  91383.00
|   |   |   |   |   |   |   |   |--- weights: [8.00, 0.00] class: 0
|   |   |--- CD_Account >  0.50
|   |   |   |--- CCAvg <= 3.85
|   |   |   |   |--- weights: [0.00, 8.00] class: 1
|   |   |   |--- CCAvg >  3.85
|   |   |   |   |--- CCAvg <= 4.05
|   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |--- CCAvg >  4.05
|   |   |   |   |   |--- ID <= 2505.50
|   |   |   |   |   |   |--- CreditCard <= 0.50
|   |   |   |   |   |   |   |--- ID <= 2348.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- ID >  2348.00
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- CreditCard >  0.50
|   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- ID >  2505.50
|   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|--- Income >  113.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- weights: [463.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- weights: [0.00, 59.00] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 116.50
|   |   |   |--- Experience <= 32.00
|   |   |   |   |--- CCAvg <= 2.80
|   |   |   |   |   |--- Experience <= 18.00
|   |   |   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Online >  0.50
|   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |--- Experience >  18.00
|   |   |   |   |   |   |--- ID <= 2052.00
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- ID >  2052.00
|   |   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |--- CCAvg >  2.80
|   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |--- Experience >  32.00
|   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |--- Income >  116.50
|   |   |   |--- weights: [0.00, 237.00] class: 1

In [ ]:
# Gini importance of features in the tree building

print(
    pd.DataFrame(
        best_estimator.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                                   Imp
Education                     0.373863
Income                        0.296189
Family                        0.182161
CCAvg                         0.049369
ID                            0.021313
Experience                    0.020157
CD_Account                    0.014389
Age                           0.009617
ZIPCode                       0.008576
Online                        0.008211
County_Riverside County       0.004878
County_Santa Barbara County   0.002525
County_Ventura County         0.002010
Mortgage_Yes                  0.001996
County_Other County           0.001917
CreditCard                    0.001597
County_Sacramento County      0.001231
County_Santa Cruz County      0.000000
County_Santa Clara County     0.000000
County_San Mateo County       0.000000
County_San Francisco County   0.000000
County_San Diego County       0.000000
County_San Bernardino County  0.000000
County_Kern County            0.000000
County_Orange County          0.000000
County_Monterey County        0.000000
County_Marin County           0.000000
County_Los Angeles County     0.000000
County_Contra Costa County    0.000000
County_Alameda County         0.000000
Mortgage_No                   0.000000
Securities_Account            0.000000
County_Yolo County            0.000000

Observations: In tuned decision tree, Education and Income are the most important features, followed by Family size.

In [ ]:
import matplotlib.pyplot as plt
from sklearn import tree

# Create a figure
fig = plt.subplots(figsize=(12, 8))

# Plot the decision tree
out = tree.plot_tree(
    best_estimator,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
No description has been provided for this image
In [ ]:
pd.DataFrame(
        best_estimator.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
Out[ ]:
Imp
Education 0.373863
Income 0.296189
Family 0.182161
CCAvg 0.049369
ID 0.021313
Experience 0.020157
CD_Account 0.014389
Age 0.009617
ZIPCode 0.008576
Online 0.008211
County_Riverside County 0.004878
County_Santa Barbara County 0.002525
County_Ventura County 0.002010
Mortgage_Yes 0.001996
County_Other County 0.001917
CreditCard 0.001597
County_Sacramento County 0.001231
County_Santa Cruz County 0.000000
County_Santa Clara County 0.000000
County_San Mateo County 0.000000
County_San Francisco County 0.000000
County_San Diego County 0.000000
County_San Bernardino County 0.000000
County_Kern County 0.000000
County_Orange County 0.000000
County_Monterey County 0.000000
County_Marin County 0.000000
County_Los Angeles County 0.000000
County_Contra Costa County 0.000000
County_Alameda County 0.000000
Mortgage_No 0.000000
Securities_Account 0.000000
County_Yolo County 0.000000

The best estimator feature importances value for all features shows in the data is 0.

In [ ]:
!pip install scikit-learn matplotlib

import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Assuming you have X_train, y_train from previous steps

# Create a decision tree classifier
clf = DecisionTreeClassifier(random_state=42)

# Get the pruning path
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

# Plot total impurity vs effective alpha
plt.figure(figsize=(10, 6))
plt.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
plt.xlabel("Effective Alpha")
plt.ylabel("Total Impurity of Leaves")
plt.title("Total Impurity vs Effective Alpha for Training Set")
plt.show()
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (1.5.2)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (3.7.1)
Requirement already satisfied: numpy>=1.19.5 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.26.4)
Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.13.1)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (3.5.0)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.3.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (4.53.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.4.7)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (24.1)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (3.1.4)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
No description has been provided for this image

We need to train the decision tree using the effective alphas. The last value in the alphas is the alpha value that prunes the whole tree, leaving the tree clfs[-1], with one node

In [ ]:
# Training the decision tree using the effective alphas

clfs = []

for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(
        random_state=1, ccp_alpha=ccp_alpha, class_weight={0: 0.15, 1: 0.85}
    )
    clf.fit(X_train, y_train)
    clfs.append(clf)

print(
    "Number of nodes in the last tree is : {} with ccp_alpha : {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is : 7 with ccp_alpha : 0.05214930148943829
In [ ]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[  :-1]
In [ ]:
recall_train = []

for clf in clfs:
    pred_train = clf.predict(X_train)
    values_train = recall_score(y_train, pred_train)
    recall_train.append(values_train)
In [ ]:
recall_test = []

for clf in clfs:
    pred_test = clf.predict(X_test)
    values_test = recall_score(y_test, pred_test)
    recall_test.append(values_test)
In [ ]:
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
In [ ]:
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
No description has been provided for this image

The maximum value of Recall is around 0.01 alpha, for both train and test sets

In [ ]:
# Creating the model where we get highest train and test recall

index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.005758202713680605,
                       class_weight={0: 0.15, 1: 0.85}, random_state=1)
In [ ]:
best_model.fit(X_train, y_train)
Out[ ]:
DecisionTreeClassifier(ccp_alpha=0.005758202713680605,
                       class_weight={0: 0.15, 1: 0.85}, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.005758202713680605,
                       class_weight={0: 0.15, 1: 0.85}, random_state=1)

Model performance on training set¶

In [ ]:
decision_tree_postpruned_perf_train = model_performance_classification(
    best_model, X_train, y_train
)
decision_tree_postpruned_perf_train
Out[ ]:
Accuracy Recall Precision F1
0 0.93525 0.992 0.592357 0.741775
In [ ]:
confusion_matrix_sklearn(best_model, X_train, y_train)
No description has been provided for this image

Model performance on test set¶

In [ ]:
decision_tree_postpruned_perf_test = model_performance_classification(
    best_model, X_test, y_test
)
decision_tree_postpruned_perf_test
Out[ ]:
Accuracy Recall Precision F1
0 0.941 0.980952 0.64375 0.777358
In [ ]:
confusion_matrix_sklearn(best_model, X_test, y_test)
No description has been provided for this image

Observations

  • The results have improved from the initial model
  • The model is giving a generalized performance on training and test set
In [ ]:
#Visualizing the decision tree

plt.figure(figsize=(10, 8))

out = tree.plot_tree(
    best_model,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=True,
    class_names=True,
)

for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
plt.show()
No description has been provided for this image
In [ ]:
# Text report showing the rules of a decision tree

print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [424.80, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- weights: [21.15, 25.50] class: 1
|--- Income >  98.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- weights: [80.55, 2.55] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- weights: [2.85, 54.40] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 114.50
|   |   |   |--- weights: [13.65, 28.05] class: 1
|   |   |--- Income >  114.50
|   |   |   |--- weights: [0.75, 208.25] class: 1

In [ ]:
# Gini importance of features in the tree building

print(
    pd.DataFrame(
        best_model.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                                   Imp
Income                        0.614717
Family                        0.164455
Education                     0.148762
CCAvg                         0.072066
ID                            0.000000
County_San Francisco County   0.000000
County_Other County           0.000000
County_Riverside County       0.000000
County_Sacramento County      0.000000
County_San Bernardino County  0.000000
County_San Diego County       0.000000
County_Santa Barbara County   0.000000
County_San Mateo County       0.000000
County_Monterey County        0.000000
County_Santa Clara County     0.000000
County_Santa Cruz County      0.000000
County_Ventura County         0.000000
County_Orange County          0.000000
County_Kern County            0.000000
County_Marin County           0.000000
County_Los Angeles County     0.000000
Age                           0.000000
County_Contra Costa County    0.000000
County_Alameda County         0.000000
Mortgage_Yes                  0.000000
Mortgage_No                   0.000000
CreditCard                    0.000000
Online                        0.000000
CD_Account                    0.000000
Securities_Account            0.000000
ZIPCode                       0.000000
Experience                    0.000000
County_Yolo County            0.000000
In [ ]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import tree

# Assuming 'best_model', 'X_train', and 'feature_names' are already defined

importances = best_model.feature_importances_
indices = np.argsort(importances)

# Instead of calling the undefined 'plot_feature_importances',
# we directly create the plot using matplotlib:
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image

Observations

Income is the most important feature to predict if the customer will borrow a personal loan Followed by Family size, Education and the CCAvg.

In [ ]:
# Training performance comparison

models_train_comp_df = pd.concat(
    [
        decision_tree_perf_train.T,
        decision_tree_tune_perf_train.T,
        decision_tree_postpruned_perf_train.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree Sklearn",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Training performance comparison : ")
models_train_comp_df
Training performance comparison : 
Out[ ]:
Decision Tree Sklearn Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 1.0 1.0 0.935250
Recall 1.0 1.0 0.992000
Precision 1.0 1.0 0.592357
F1 1.0 1.0 0.741775

Model Performance Comparison and Final Model Selection¶

In [ ]:
# Test performance comparison

models_test_comp_df = pd.concat(
    [
        decision_tree_perf_test.T,
        decision_tree_tune_perf_test.T,
        decision_tree_postpruned_perf_test.T, # Fixed: This line was split and incorrectly indented
    ],
    axis=1,
)

models_test_comp_df.columns = [
    "Decision Tree Sklearn",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Test performance comparison : ")
models_test_comp_df
Test performance comparison : 
Out[ ]:
Decision Tree Sklearn Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 0.982000 0.986000 0.941000
Recall 0.877358 0.904762 0.980952
Precision 0.948980 0.959596 0.643750
F1 0.911765 0.931373 0.777358

Conclusion Decision tree model with pre-pruning has given the best recall score on training data. Decision tree model with post-pruning has given the best recall score on test set. The tree with post pruning is not complex and is easy to interpret.

Logistic Regression and Decision Tree¶

Model comparison¶

Logistic Regression

Outliers were treated Variables with multicollinearity and a p>0.05 were dropped in the process of model building Highest accuracy in post pruning is 0.99 and recall is 0.91. Able to interpret that a unit increase in any of the variables (Income, Family, CCAvg, CD_Account, Education levels, Riverside County) will increase the odds of a customer taking a personal loan.

Decision Tree

Variables with outliers and multicollinearity were excluded as they did not impact the Decision Tree. The highest recall achieved on the test set was 0.99 using a post-pruned decision tree, yielding consistent results across train and test sets. Notably, Income, Family, Education, and CCAvg emerged as key predictors of a customer's likelihood to borrow a personal loan.

Actionable Insights and Business Recommendations¶

Recommendation¶

The marketing team of the bank should target customers with the following:

  1. The bank's marketing efforts should concentrate on high-value client characterized by annual incomes over 120,000 graduate or advanced/professional education,family siced of three or more,existing certificate of deposit accounts,and substntial credit cards pending(above $2500/month).
  2. Additional targeting criteria include mortgage holders, individuals aged 35+, those with 9+ years of experience, and residents of LA, Santa Clara, San Diego, and Riverside Counties for enhanced effectiveness.